name: sc-best-practices-auto description: 单细胞分析最佳实践集合——目录索引自动发现，完整抓取HTML/MD与代码块

Sc-Best-Practices-Auto Skill

Comprehensive assistance with sc-best-practices-auto development, generated from official documentation.

When to Use This Skill

This skill should be triggered when:

Working with sc-best-practices-auto
Asking about sc-best-practices-auto features or APIs
Implementing sc-best-practices-auto solutions
Debugging sc-best-practices-auto code
Learning sc-best-practices-auto best practices

Quick Reference

Common Patterns

Pattern 1: Search Ctrl+K Introduction 1. Prior art 2. Single-cell RNA sequencing 3. Raw data processing 4. Analysis frameworks and tools 5. Interoperability Preprocessing and visualization 6. Quality Control 7. Normalization 8. Feature selection 9. Dimensionality Reduction Identifying cellular structure 10. Clustering 11. Annotation 12. Data integration Inferring trajectories 13. Pseudotemporal ordering 14. RNA velocity 15. Lineage tracing Dealing with conditions 16. Differential gene expression analysis 17. Compositional analysis 18. Gene set enrichment and pathway analysis 19. Perturbation modeling Modeling mechanisms 20. Gene regulatory networks 21. Cell-cell communication Deconvolution 22. Bulk deconvolution Chromatin Accessibility 23. Single-cell ATAC sequencing 24. Quality Control 25. Gene regulatory networks using chromatin accessibility Spatial omics 26. Single-cell data resolved in space 27. Neighborhood analysis 28. Spatial domains 29. Spatially variable genes 30. Spatial deconvolution 31. Imputation Surface protein 32. Quality control 33. Normalization 34. Doublet detection 35. Dimensionality Reduction 36. Batch correction 37. Annotation Adaptive immune receptor repertoire 38. Immune Receptor Profiling 39. Clonotype analysis 43. Specificity analysis 44. Integrating AIR and transcriptomics Multimodal integration 45. Paired integration 46. Advanced integration Reproducibility 47. Reproducibility Outlook 48. Outlook Acknowledgements 49. Acknowledgements Glossary 50. Glossary Repository Suggest edit Open issue .md .pdf Raw data processing Contents 3.1. Raw data quality control 3.2. Alignment and mapping 3.2.1. Types of mapping 3.2.2. Mapping against different reference sequences 3.2.2.1. Mapping to the full genome 3.2.2.2. Mapping to the spliced transcriptome 3.2.2.3. Mapping to an augmented transcriptome 3.3. Cell barcode correction 3.3.1. Type of errors in barcoding 3.4. UMI resolution 3.4.1. The need for UMI resolution 3.4.1.1. Quantification 3.5. Count matrix quality control 3.5.1. Empty droplet detection 3.5.2. Doublet detection 3.6. Count data representation 3.7. Brief discussion 3.8. A real-world example 3.8.1. Preparation 3.8.2. Simplified raw data processing pipeline 3.8.3. The complete alevin-fry pipeline 3.8.3.1. Building the index 3.8.3.2. Mapping and quantification 3.9. Useful links 3.10. References 3.11. Contributors 3.11.1. Authors 3.11.2. Reviewers 3. Raw data processing# Raw data processing in single-cell sequencing converts sequencing machine output (so-called lane-demultiplexed FASTQ files) into readily analyzable representations such as a count matrix. This matrix represents the estimated number of distinct molecules derived from each gene per quantified cell, sometimes categorized by the inferred splicing status of each molecule (Fig. 3.1). Fig. 3.1 An overview of the topics discussed in this chapter. In the plot, “txome” stands for transcriptome.# The count matrix is the foundation for a wide range of scRNA-seq analyses [Zappia and Theis, 2021], including cell type identification or developmental trajectory inference. A robust and accurate count matrix is essential for reliable downstream analyses. Errors at this stage can lead to invalid conclusions and discoveries based on missed insights, or distorted signals in the data. Despite the straightforward nature of the input (FASTQ files) and the desired output (count matrix), raw data processing presents several technical challenges. In this section, we focus on key steps of raw data processing: Read alignment/mapping Cell barcode (CB) identification and correction Estimation of molecule counts through unique molecular identifiers (UMIs) We also discuss the challenges and trade-offs involved in each step. A note on preceding steps The starting point for raw data processing is somewhat arbitrary. For this discussion, we treat lane-demultiplexed FASTQ files as the raw input. However, these files are derived from earlier steps, such as base calling and base quality estimation, which can influence downstream processing. For example, base-calling errors and index hopping [Farouni et al., 2020] can introduce inaccuracies in FASTQ data. These issues can be mitigated with computational approaches [Farouni et al., 2020] or experimental enhancements like dual indexing. Here, we do not delve into the upstream processes, but consider the FASTQ files, derived from, e.g., BCL files via appropriate tools, as the raw input under consideration. 3.1. Raw data quality control# After obtaining raw FASTQ files, it is important to evaluate the quality of the sequencing reads. A quick and effective way to perform this is by using quality control (QC) tools like FastQC. FastQC generates a detailed report for each FASTQ file, summarizing key metrics such as quality scores, base content, and other statistics that help identify potential issues arising from library preparation or sequencing. While many modern single-cell data processing tools include some built-in quality checks—such as evaluating the N content of sequences or the fraction of mapped reads - it is still good practice to run an independent QC check. For readers interested in what a typical FastQC report looks like, in the following toggle content, example reports for both high-quality and low-quality Illumina data provided by the FastQC manual webpage, along with the tutorials and descriptions from the RTSF at MSU, the HBC training program, and the QC Fail website are used to demonstrate the modules in the FastQC report. Although these tutorials are not explicitly made for single-cell data, many of the results are still relevant for single-cell data, with a few caveats described below. In the toggle section, all graphs, except specifically mentioned, are taken from the example reports on the FastQC manual webpage. It is important to note that many QC metrics in FastQC reports are most meaningful only for biological reads—those derived from gene transcripts. For single-cell datasets, such as 10x Chromium v2 and v3, this typically corresponds to read 2 (the files containing R2 in their filename), which contain transcript-derived sequences. In contrast, technical reads, which contain barcode and UMI sequences, often do not exhibit biologically typical sequence or GC content. However, certain metrics, like the fraction of N base calls, are still relevant for all reads. Example FastQC Reports and Tutorials 0. Summary The summary panel on the left side of the HTML report displays the module names along with symbols that provide a quick assessment of the module results. However, FastQC applies uniform thresholds across all sequencing platforms and biological materials. As a result, warnings (orange exclamation marks) or failures (red crosses) may appear for high-quality data, while questionable data might receive passes (green ticks). Therefore, each module should be carefully reviewed before drawing conclusions about data quality. Fig. 3.2 The summary panel of a bad example.# 1. Basic statistics The basic statistics module provides an overview of key information and statistics for the input FASTQ file, including the filename, total number of sequences, number of poor-quality sequences, sequence length, and the overall GC content (%GC) across all bases in all sequences. High-quality single-cell data typically have very few poor-quality sequences and exhibit a uniform sequence length. Additionally, the GC content should align with the expected GC content of the genome or transcriptome of the sequenced species. Fig. 3.3 A good basic statistics report example.# 2. Per base sequence quality The per-base sequence quality view displays a box-and-whisker plot for each position in the read. The x-axis represents the positions within the read, while the y-axis shows the quality scores. For high-quality single-cell data, the yellow boxes—representing the interquartile range of quality scores—should fall within the green area (indicating good quality calls). Similarly, the whiskers, which represent the 10th and 90th percentiles of the distribution, should also remain within the green area. It is common to observe a gradual drop in quality scores along the length of the read, with some base calls at the last positions falling into the orange area (reasonable quality) due to a decreasing signal-to-noise ratio, a characteristic of sequencing-by-synthesis methods. However, the boxes should not extend into the red area (poor quality calls). If poor-quality calls are observed, quality trimming may be necessary. A more detailed explanation of sequencing error profiles can be found in the HBC training program. Fig. 3.4 A good (left) and a bad (right) per-read sequence quality graph.# 3. Per tile sequence quality Using an Illumina library, the per-tile sequence quality plot highlights deviations from the average quality for reads across each flowcell tile(miniature imaging areas of the flowcell). The plot uses a color gradient to represent deviations, where warmer colors indicate larger deviations. High-quality data typically display a uniform blue color across the plot, indicating consistent quality across all tiles of the flowcell. If warm colors appear in certain areas, it suggests that only part of the flowcell experienced poor quality. This could result from transient issues during sequencing, such as bubbles passing through the flowcell or smudges and debris within the flowcell lane. For further investigation, consult resources like QC Fail and the common reasons for warnings provided in the FastQC manual. Fig. 3.5 A good (left) and a bad (right) per tile sequence quality view.# 4. Per sequence quality scores The per-sequence quality score plot displays the distribution of average quality scores for each read in the file. The x-axis represents the average quality scores, while the y-axis shows the frequency of each score. For high-quality data, the plot should have a single peak near the high-quality end of the scale. If additional peaks appear, it may indicate a subset of reads with quality issues. Fig. 3.6 A good (left) and a bad (right) per sequence quality score plot.# 5. Per base sequence content The per-base sequence content plot shows the percentage of each nucleotide (A, T, G, and C) called at each base position across all reads in the file. For single-cell data, it is common to observe fluctuations at the start of the reads. This occurs because the initial bases represent the sequence of the priming sites, which are often not perfectly random. This is a frequent occurrence in RNA-seq libraries, even though FastQC may flag it with a warning or failure, as noted on the QC Fail website. Fig. 3.7 A good (left) and bad (right) per base sequence content plot.# 6. Per sequence GC content The per-sequence GC content plot displays the GC content distribution across all reads (in red) compared to a theoretical distribution (in blue). The central peak of the observed distribution should align with the overall GC content of the transcriptome. However, the observed distribution may appear wider or narrower than the theoretical one due to differences between the transcriptome’s GC content and the genome’s expected GC distribution. Such variations are common and may trigger a warning or failure in FastQC, even if the data is acceptable. A complex or irregular distribution in this plot, however, often indicates contamination in the library. It is also important to note that interpreting GC content in transcriptomics can be challenging. The expected GC distribution depends not only on the sequence composition of the transcriptome but also on gene expression levels in the sample, which are typically unknown beforehand. As a result, some deviation from the theoretical distribution is not unusual in RNA-seq data. Fig. 3.8 A good (left) and a bad (right) per sequence GC content plot. The plot on the left is from the RTSF at MSU. The plot on the right is taken from the HBC training program.# 7. Per base N content The per-base N content plot displays the percentage of bases at each position that were called as N, indicating that the sequencer lacked sufficient confidence to assign a specific nucleotide. In a high-quality library, the N content should remain consistently at or near zero across the entire length of the reads. Any noticeable non-zero N content may indicate issues with sequencing quality or library preparation. Fig. 3.9 A good (left) and a bad (right) per base N content plot.# 8. Sequence length distribution The sequence length distribution graph displays the distribution of read lengths across all sequences in the file. For most single-cell sequencing chemistries, all reads are expected to have the same length, resulting in a single peak in the graph. However, if quality trimming was applied before the quality assessment, some variation in read lengths may be observed. Small differences in read lengths due to trimming are normal and should not be a cause for concern if expected. Fig. 3.10 A good (left) and a bad (right) sequence length distribution plot.# 9. Sequence duplication levels The sequence duplication level plot illustrates the distribution of duplication levels for read sequences, represented by the blue line, both before and after deduplication. In single-cell platforms, multiple rounds of PCR are typically required, and highly expressed genes naturally produce a large number of transcripts. Additionally, since FastQC is not UMI-aware (i.e., it does not account for unique molecular identifiers), it is common for a small subset of sequences to show high duplication levels. While this may trigger a warning or failure in this module, it does not necessarily indicate a quality issue with the data. However, the majority of sequences should still exhibit low duplication levels, reflecting a diverse and well-prepared library. Fig. 3.11 A good (left) and a bad (right) per sequence duplication levels plot.# 10. Overrepresented sequences The overrepresented sequences module identifies read sequences that constitute more than 0.1% of the total reads. In single-cell sequencing, some overrepresented sequences may arise from highly expressed genes amplified during PCR. However, the majority of sequences should not be overrepresented. If the source of an overrepresented sequence is identified (i.e., not listed as “No Hit”), it could indicate potential contamination in the library from the corresponding source. Such cases warrant further investigation to ensure data quality. Fig. 3.12 An overrepresented sequence table.# 11. Adapter content The adapter content module displays the cumulative percentage of reads containing adapter sequences at each base position. High levels of adapter sequences indicate incomplete removal of adapters during library preparation, which can interfere with downstream analyses. Ideally, no significant adapter content should be present in the data. If adapter sequences are abundant, additional trimming may be necessary to improve data quality. Fig. 3.13 A good (left) and a bad (right) per sequence quality score plot. The plot on the right is from the QC Fail website.# Multiple FastQC reports can be combined into a single report using the tool MultiQC. 3.2. Alignment and mapping# Mapping or Alignment is a critical step in single-cell raw data processing. It involves determining the potential loci of origin for each sequenced fragment, such as the genomic or transcriptomic locations that closely match the read sequence. This step is essential for correctly assigning reads to their source regions. In single-cell sequencing protocols, the raw sequence files typically include: Cell Barcodes (CB): Unique identifiers for individual cells. Unique Molecular Identifiers (UMIs): Tags that distinguish individual molecules to account for amplification bias. Raw cDNA Sequences: The actual read sequences generated from the molecules. As the first step (Fig. 3.1), accurate mapping or alignment is crucial for reliable downstream analyses. Errors during this step, such as incorrect mapping of reads to transcripts or genes, can result in inaccurate or misleading count matrices. While mapping read sequences to reference sequences far predates the development of scRNA-seq, the sheer scale of modern scRNA-seq datasets—often involving hundreds of millions to billions of reads—makes this step particularly computationally intensive. Many existing RNA-seq aligners are protocol-agnostic and do not inherently account for features specific to scRNA-seq, such as cell barcodes, UMIs, or their positions and lengths. As a result, additional tools are often required for steps like demultiplexing and UMI resolution [Smith et al., 2017]. To address the challenges of aligning and mapping scRNA-seq data, several specialized tools have been developed that handle the additional processing requirements automatically or internally. These tools include: Cell Ranger (commercial software from 10x Genomics) [Zheng et al., 2017] zUMIs [Parekh et al., 2018] alevin [Srivastava et al., 2019] RainDrop [Niebler et al., 2020] kallisto|bustools [Melsted et al., 2021] STARsolo [Kaminow et al., 2021] alevin-fry [He et al., 2022] These tools provide specialized capabilities for aligning scRNA-seq reads, parsing technical read content (e.g., cell barcodes and UMIs), demultiplexing, and UMI resolution. Although they offer simplified user interfaces, their internal methodologies differ significantly. Some tools generate traditional intermediate files, such as BAM files, which are processed further, while others operate entirely in memory or use compact intermediate representations to minimize input/output operations and reduce computational overhead. While these tools vary in their specific algorithms, data structures, and trade-offs in time and space complexity, their approaches can generally be categorized along two axes: The type of mapping they perform, and The type of reference sequence against which they map reads. 3.2.1. Types of mapping# We focus on three main types of mapping algorithms commonly used for mapping sc/snRNA-seq data: spliced alignment, contiguous alignment, and variations of lightweight mapping. First, we distinguish between alignment-based approaches and lightweight mapping-based approaches (Fig. 3.14). Alignment-based methods use various heuristics to identify potential loci from which reads may originate and then score the best nucleotide-level alignment between the read and reference, typically using dynamic programming algorithms. global alignment aligns the entirety of the query and reference sequences, while local alignment focuses on aligning subsequences. Short-read alignment often employs a semi-global approach, also known as “fitting” alignment, where most of the query aligns to a substring of the reference. Additionally, “soft-clipping” may be used to reduce penalties for mismatches, insertions, or deletions at the start or end of the read, achieved through “extension” alignment. While these variations modify the rules of the dynamic programming recurrence and traceback, they do not fundamentally alter its overall complexity. Several sophisticated modifications and heuristics have been developed to enhance the practical efficiency of aligning genomic sequencing reads. For example, banded alignment [Chao et al., 1992] is a popular heuristic used by many tools to avoid computing large portions of the dynamic programming table when alignment scores below a threshold are not of interest. Other heuristics, like X-drop [Zhang et al., 2000] and Z-drop [Li, 2018], efficiently prune unpromising alignments early in the process. Recent advances, such as wavefront alignment [Marco-Sola et al., 2020], marco2022optimal, enable the determination of optimal alignments in significantly reduced time and space, particularly when high-scoring alignments are present. Additionally, much work has focused on optimizing data layout and computation to leverage instruction-level parallelism [Farrar, 2007, Rognes and Seeberg, 2000, Wozniak, 1997], and expressing dynamic programming recurrences in ways that facilitate data parallelism and vectorization, such as through difference encoding Suzuki and Kasahara [2018]. Most widely-used alignment tools incorporate these highly optimized, vectorized implementations. In addition to the alignment score, the backtrace of the actual alignment that produces this score is often encoded as a CIGAR string (short for “Concise Idiosyncratic Gapped Alignment Report”). This alphanumeric representation is typically stored in the SAM or BAM file output. For example, the CIGAR string 3M2D4M indicates that the alignment has three matches or mismatches, followed by a deletion of length two (representing bases present in the reference but not the read), and then four more matches or mismatches. Extended CIGAR strings can provide additional details, such as distinguishing between matches, mismatches, or insertions. For instance, 3=2D2=2X encodes the same alignment as the previous example but specifies that the three bases before the deletion are matches, followed by two matched bases and two mismatched bases after the deletion. A detailed description of the CIGAR string format can be found in the SAMtools manual or the SAM wiki page of UMICH. Alignment-based approaches, though computationally expensive, provide a quality score for each potential mapping of a read. This score allows them to distinguish between high-quality alignments and low-complexity or “spurious” matches between the read and reference. These approaches include traditional “full-alignment” methods, such as those implemented in tools like STAR [Dobin et al., 2013] and STARsolo [Kaminow et al., 2021], as well as selective-alignment methods, like those in salmon [Srivastava et al., 2020] and alevin [Srivastava et al., 2019], which score mappings but skip the computation of the optimal alignment’s backtrace. Fig. 3.14 An abstract overview of the alignment-based method and lightweight mapping-based method.# Alignment-based approaches can be categorized into spliced-alignment and contiguous-alignment methods. Spliced-alignment methods Spliced-alignment methods allow a sequence read to align across multiple distinct segments of a reference, allowing potentially large gaps between aligned regions. These approaches are particularly useful for aligning RNA-seq reads to the genome, where reads may span splice junctions. In such cases, a contiguous sequence in the read may be separated by intron and exon subsequence in the reference, potentially spanning kilobases of sequence. Spliced alignment is especially challenging when only a small portion of a read overlaps a splice junction, as limited sequence information is available to accurately place the overhanging segment. Contiguous-alignment methods Contiguous-alignment methods require a continuous substring of the reference to align well with the read. While small insertions and deletions may be tolerated, large gaps—such as those in spliced alignments—are generally not allowed. Alignment-based methods, such as spliced and contiguous alignment, can be distinguished from lightweight-mapping methods, which include approaches like pseudoalignment [Bray et al., 2016], quasi-mapping [Srivastava et al., 2016], and pseudoalignment with structural constraints [He et al., 2022]. Lightweight-mapping methods achieve significantly higher speed. However, they do not provide easily-interpretable score-based assessments to determine the quality of a match, making it more difficult to assess alignment confidence. 3.2.2. Mapping against different reference sequences# In addition to selecting a mapping algorithm, choices can also be made regarding the reference sequence against which the reads are mapped. There are three main categories of reference sequences: Full reference genome (typically annotated) Annotated transcriptome Augmented transcriptome Currently, not all combinations of mapping algorithms and reference sequences are possible. For instance, lightweight-mapping algorithms do not yet support spliced mapping of reads against a reference genome. 3.2.2.1. Mapping to the full genome# The first type of reference used for mapping is the entire genome of the target organism, typically with annotated transcripts considered during mapping. Tools such as zUMIs [Parekh et al., 2018], Cell Ranger [Zheng et al., 2017], and STARsolo [Kaminow et al., 2021] follow this approach. Since many reads originate from spliced transcripts, this method requires a splice-aware alignment algorithm capable of splitting alignments across one or more splice junctions. A key advantage of this approach is that it accounts for reads arising from any location in the genome, not just those from annotated transcripts. Additionally, because a genome-wide index is constructed, there is minimal additional cost in reporting not only reads that map to known spliced transcripts but also those that overlap introns or align within non-coding regions, making this method equally effective for single-cell and single-nucleus data. Another benefit is that even reads mapping outside annotated transcripts, exons, or introns can still be accounted for, enabling post hoc augmentation of the quantified loci. 3.2.2.2. Mapping to the spliced transcriptome# To reduce the computational overhead of spliced alignment to a genome, a widely adopted alternative is to use only the annotated transcript sequences as the reference. Since most single-cell experiments are conducted on model organisms like mouse or human, which have well-annotated transcriptomes, transcriptome-based quantification can achieve similar read coverage to genome-based methods. Compared to the genome, transcriptome sequences are much smaller, significantly reducing the computational resources needed for mapping. Additionally, because splicing patterns are already represented in transcript sequences, this approach eliminates the need for complex spliced alignment. Instead, one can simply search for contiguous alignments or mappings for the read. Alternatively, reads can be mapped using contiguous alignments, making both alignment-based and lightweight-mapping techniques suitable for transcriptome references. While these approaches significantly reduce the memory and time required for alignment and mapping, they fail to capture reads that arise from outside the spliced transcriptome. As a result, they are not suitable for processing single-nucleus data. Even in single-cell experiments, reads arising from outside of the spliced transcriptome can constitute a substantial fraction of all data, and there is growing evidence that such reads should be incorporated into subsequent analysis [Pool et al., 2022, 10x Genomics, 2021]. Even in single-cell experiments, a substantial fraction of reads may arise from regions outside the spliced transcriptome, and increasing evidence suggests that incorporating these reads into downstream analyses can be beneficial [Pool et al., 2022, 10x Genomics, 2021]. Additionally, when paired with lightweight-mapping methods, short sequences shared between the spliced transcriptome and the actual genomic regions that generated a read can lead to spurious mappings. This, in turn, may result in misleading and even biologically implausible gene expression estimates [Brüning et al., 2022, He et al., 2022, Kaminow et al., 2021]. 3.2.2.3. Mapping to an augmented transcriptome# To account for reads originating outside spliced transcripts, the spliced transcript sequences can be augmented with additional reference sequences, such as full-length unspliced transcripts or excised intronic sequences. This enables better, faster, and more memory-efficient mapping compared to full-genome alignment, while still capturing many reads that would otherwise be missed. More reads can be confidently assigned compared to using only the spliced transcriptome, and when combined with lightweight mapping approaches, spurious mappings can be significantly reduced [He et al., 2022]. Augmented transcriptomes are widely used in methods that do not map to the full genome, particularly for single-nucleus data processing and RNA velocity analysis [Soneson et al., 2021] (see RNA velocity). These augmented references can be constructed for all common methods that do not rely on spliced alignment to the full genome [He et al., 2022, Melsted et al., 2021, Srivastava et al., 2019]. 3.3. Cell barcode correction# Droplet-based single-cell segregation systems, such as those provided by 10x Genomics, have become an important tool for studying the cause and consequences of cellular heterogeneity. In this segregation system, the RNA material of each captured cell is extracted within a water-based droplet encapsulation along with a barcoded bead. These beads tag the RNA content of individual cells with unique oligonucleotides, called cell barcodes (CBs), that are later sequenced along with the fragments of the cDNAs that are reversely transcribed from the RNA content. The beads contain high-diversity DNA barcodes, allowing for parallel barcoding of a cell’s molecular content and in silico demultiplexing of sequencing reads into individual cellular bins. A note on alignment orientation Depending on the sample chemistry and user-defined processing options, not all sequenced fragments that align to the reference are necessarily considered for quantification and barcode correction. One commonly-applied criterion for filtering is alignment orientation. Specifically, certain chemistries specify protocols such that the aligned reads should only derive from (i.e. map back to) the underlying transcripts in a specific orientation. For example, in 10x Genomics 3’ Chromium chemistries, we expect the biological read to align to the underlying transcript’s forward strand, though anti-sense reads do exist [10x Genomics, 2021]. As a result, reads mapped in the reverse-complement orientation to the reference sequences may be ignored or filtered out based on user-defined settings. If a chemistry follows such a so-called “stranded” protocol, this should be documented. 3.3.1. Type of errors in barcoding# The tag, sequence, and demultiplexing method used for single-cell profiling is generally effective. However, in droplet-based libraries, the number of observed cell barcodes (CBs) can differ significantly—often by several fold—from the number of originally encapsulated cells. This discrepancy arises from several key sources of error: Doublets/multiplets: A single barcode may be associated with multiple cells, leading to an undercounting of cells. Empty droplets: Some droplets contain no encapsulated cells, and ambient RNA can become tagged with a barcode and sequenced, resulting in overcounting of cells. Sequence errors: Errors introduced during PCR amplification or sequencing can distort barcode counts, contributing to both under- and over-counting. To address these issues, computational tools for demultiplexing RNA-seq reads into cell-specific bins use various diagnostic indicators to filter out artefactual or low-quality data. Numerous methods exist for removing ambient RNA contamination [Lun et al., 2019, Muskovic and Powell, 2021, Young and Behjati, 2020], detecting doublets [Bais and Kostka, 2019, DePasquale et al., 2019, McGinnis et al., 2019, Wolock et al., 2019], and correcting cell barcode errors based on nucleotide sequence similarity. Several common strategies are used for cell barcode identification and correction. Correction against a known list of potential barcodes: Certain chemistries, such as 10x Chromium, draw CBs from a known pool of potential barcode sequences. Thus, the set of barcodes observed in any sample is expected to be a subset of this known list, often called a “whitelist”. In this case, the standard approach assumes that: Any barcode matching an entry in the known list is correct. Any barcode not in the list is corrected by finding the closest match from the permit list, typically using Hamming distance or edit distance. This strategy allows for efficient barcode correction but has limitations. If a corrupted barcode closely resembles multiple barcodes in the permit list, its correction becomes ambiguous. For example, for a barcode taken from the 10x Chromium v3 permit list and mutated at a single position to a barcode not in the list, there is an (\sim 81%) probability that it sits at hamming distance (1) from two or more barcodes in the permit list. The probability of such collisions can be reduced by considering correcting only against barcodes from the known permit list, which, themselves, occur exactly in the given sample (or even only those that occur exactly in the given sample above some nominal frequency threshold). Also, information such as the base quality at the “corrected” position can be used to potentially break ties in the case of ambiguous corrections. Yet, as the number of assayed cells increases, insufficient sequence diversity in the set of potential cell barcodes increases the frequency of ambiguous corrections, and reads tagged with barcodes having ambiguous corrections are most commonly discarded. Knee or elbow-based methods: If a set of potential barcodes is unknown - or even if it is known, but one wishes to correct directly from the observed data itself without consulting an external list - one can use a method based on the observation that high-quality barcodes are those associated with the highest number of reads in the sample. To achieve this, one can construct a cumulative frequency plot where barcodes are sorted in descending order based on the number of distinct reads or UMIs they are associated with. Often, this ranked cumulative frequency plot will contain a “knee” or “elbow” – an inflection point that can be used to characterize frequently occurring barcodes from infrequent (and therefore likely erroneous) barcodes. Many methods exist for attempting to identify such an inflection point [He et al., 2022, Lun et al., 2019, Smith et al., 2017] as a likely point of discrimination between properly captured cells and empty droplets. Subsequently, the set of barcodes that appear “above” the knee can be treated as a permit list against which the rest of the barcodes may be corrected, as in the first method list above. Such an approach is flexible as it can be applied in chemistries that have an external permit list and those that don’t. Further parameters of the knee-finding algorithms can be altered to yield more or less restrictive selected barcode sets. Yet, such an approach can have certain drawbacks, like a tendency to be overly conservative and sometimes failing to work robustly in samples where no clear knee is present. Filtering and correction based on an expected cell count: When barcode frequency distributions lack a clear knee or show bimodal patterns due to technical artifacts, barcode correction can be guided by a user-provided expected cell count. In such an approach, the user provides an estimate of the expected number of assayed cells. Then, the barcodes are ordered by descending frequency, the frequency (f) at a robust quantile index near the expected cell count is obtained, and all cells having a frequency within a small constant fraction of (f) (e.g., (\ge \frac{f}{10})) are considered as valid barcodes. Again, the remaining barcodes are corrected against this valid list by attempting to correct uniquely to one of these valid barcodes based on sequence similarity. Filtering based on a forced number of valid cells: The simplest approach, although potentially problematic, is for the user to manually specify the number of valid barcodes. The user chooses an index in the sorted barcode frequency list. All barcodes above this threshold are considered valid. Remaining barcodes are corrected against this list using standard similarity-based correction methods. While this guarantees selection of at least n cells, it assumes that the chosen threshold accurately reflects the number of real cells. It is only reasonable if the user has a good reason to believe that the threshold frequency should be set around the provided index. 3.4. UMI resolution# After cell barcode (CB) correction, reads have either been discarded or assigned to a corrected CB. Subsequently, we wish to quantify the abundance of each gene within each corrected CB. Because of the amplification bias as discussed in Transcript quantification, reads must be deduplicated, based upon their UMI, to assess the true count of sampled molecules. Additionally, several other complicating factors present challenges when attempting to perform this estimation. The UMI deduplication step aims to identify the set of reads and UMIs derived from each original, pre-PCR molecule in each cell captured and sequenced in the experiment. The result of this process is to allocate a molecule count to each gene in each cell, which is subsequently used in the downstream analysis as the raw expression estimate for this gene. We refer to this process of looking at the collection of observed UMIs and their associated mapped reads and attempting to infer the original number of observed molecules arising from each gene as the process of UMI resolution. To simplify the explanation, reads that map to a reference (e.g., a genomic locus of a gene) are referred to as the reads of that reference, and their UMI tags are called the UMIs of that reference. The set of reads associated with a specific UMI is referred to as the reads of that UMI. A read can be tagged by only one UMI but may belong to multiple references if it maps to more than one. Additionally, since molecule barcoding in scRNA-seq is typically isolated and independent for each cell (aside from the previously discussed challenges in resolving cell barcodes), UMI resolution will be explained for a single cell without loss of generality. This same procedure is generally applied to all cells independently. 3.4.1. The need for UMI resolution# In the ideal case, where the correct (unaltered) UMIs tag reads, the reads of each UMI uniquely map to a common reference gene, and there is a bijection between UMIs and pre-PCR molecules. Consequently, the UMI deduplication procedure is conceptually straightforward: the reads of a UMI are the PCR duplicates from a single pre-PCR molecule. The number of captured and sequenced molecules of each gene is the number of distinct UMIs observed for this gene. However, the problems encountered in practice make the simple rules described above insufficient for identifying the gene origin of UMIs in general and necessitate the development of more sophisticated models: Errors in UMIs: These occur when the sequenced UMI tag of reads contains errors introduced during PCR or the sequencing process. Common UMI errors include nucleotide substitutions during PCR and read errors during sequencing. Failing to address such UMI errors can inflate the estimated number of molecules [Smith et al., 2017, Ziegenhain et al., 2022]. Multimapping: This issue arises in cases where a read or UMI belongs to multiple references (e.g., multi-gene reads/UMIs). This happens when different reads of a UMI map to different genes, when a read maps to multiple genes, or both. The consequence of this issue is that the gene origin of the multi-gene reads/UMIs is ambiguous, which results in uncertainty about the sampled pre-PCR molecule count of those genes. Simply discarding multi-gene reads/UMIs can lead to a loss of data or a biased estimate among genes that tend to produce multimapping reads, such as sequence-similar gene families [Srivastava et al., 2019]. A Note on UMI Errors UMI errors, especially those due to nucleotide substitutions and miscallings, are prevalent in single-cell experiments. Smith et al. [2017] establish that the average number of bases different (edit distance) between the observed UMI sequences in the tested single-cell experiments is lower than randomly sampled UMI sequences, and the enrichment of low edit distances is well correlated with the degree of PCR amplification. Multimapping also exists in single-cell data and, depending upon the gene being considered, can occur at a non-trivial rate. Srivastava et al. [2019] show that discarding the multimapping reads can negatively bias the predicted molecule counts. There exist other challenges that we do not focus upon here, such as “convergent” and “divergent” UMI collisions. We consider the case where the same UMI is used to tag two different pre-PCR molecules arising from the same gene, in the same cell, as a convergent collision. When two or more distinct UMIs arise from the same pre-PCR molecule, e.g., due to the sampling of multiple priming sites from this molecule, we consider this a divergent collision. We expect convergent UMI collisions to be rare and, therefore, their effect typically small. Further, transcript-level mapping information can sometimes be used to resolve such collisions [Srivastava et al., 2019]. Divergent UMI collisions occur primarily among introns of unspliced transcripts [10x Genomics, 2021], and approaches to addressing the issues they raise are an area of active research [Gorin and Pachter, 2021, 10x Genomics, 2021]. Given that the use of UMIs is near ubiquitous in high-throughput scRNA-seq protocols and the fact that addressing these errors improves the estimation of gene abundances, there has been much attention paid to the problem of UMI resolution in recent literature [Bose et al., 2015, He et al., 2022, Islam et al., 2013, Kaminow et al., 2021, Macosko et al., 2015, Melsted et al., 2021, Orabi et al., 2018, Parekh et al., 2018, Smith et al., 2017, Srivastava et al., 2019, Tsagiopoulou et al., 2021]. Graph-based UMI resolution Graph-based UMI resolution As a result of the problems that ariOther UMI resolution approaches exist, for example, the reference-free model [Tsagiopoulou et al., 2021] and the method of moments [Melsted et al., 2021], but they may not be easily represented in this framework and are not discussed in further detail here.se when attempting to resolve UMIs, many methods have been developed to address the problem of UMI resolution. While there are a host of different approaches for UMI resolution, we will focus on a framework for representing problem instances, modified from a framework initially proposed by Smith et al. [2017], that relies upon the notion of a UMI graph. Each connected component of this graph represents a sub-problem wherein certain subsets of UMIs are collapsed (i.e., resolved as evidence of the same pre-PCR molecule). Many popular UMI resolution approaches can be interpreted in this framework by simply modifying precisely how the graph is refined and how the collapse or resolution procedure carried out over this graph works. In the context of single-cell data, a UMI graph (G(V,E)) is a directed graph with a node set (V) and an edge set (E). Each node (v_i \in V) represents an equivalence class (EC) of reads, and the edge set (E) encodes the relationship between the ECs. The equivalence relation (\sim_r) defined on reads is based on their UMI and mapping information. We say reads (r_x) and (r_y) are equivalent, (r_x \sim_r r_y), if and only if they have identical UMI tags and map to the same set of references. UMI resolution approaches may define a “reference” as a genomic locus [Smith et al., 2017], transcript [He et al., 2022, Srivastava et al., 2019] or gene [Kaminow et al., 2021, Zheng et al., 2017]. In the UMI graph framework, a UMI resolution approach can be divided into three major steps: defining nodes, defining adjacency relationships, and resolving components. Each of these steps has different options that can be modularly composed by different approaches. Additionally, these steps may sometimes be preceded (and/or followed) by filtering steps designed to discard or heuristically assign (by modifying the set of reference mappings reported) reads and UMIs exhibiting certain types of mapping ambiguity. Defining nodes As described above, a node (v_i \in V) is an equivalence class of reads. Therefore, (V) can be defined based on the full or filtered set of mapped reads and their associated uncorrected UMIs. All reads that satisfy the equivalence relation (\sim_r) based on their reference set and UMI tag are associated with the same vertex (v \in V). An EC is a multi-gene EC if its UMI is a multi-gene UMI. Some approaches will avoid the creation of such ECs by filtering or heuristically assigning reads prior to node creation, while other approaches will retain and process these ambiguous vertices and attempt and resolve their gene origin via parsimony, probabilistic assignment, or based on a related rule or model [He et al., 2022, Kaminow et al., 2021, Srivastava et al., 2019]. Defining the adjacency relationship After creating the node set (V) of a UMI graph, the adjacency of nodes in (V) is defined based on the distance, typically the Hamming or edit distance, between their UMI sequences and, optionally, the content of their associated reference sets. Here we define the following functions on the node (v_i \in V): (u(v_i)) is the UMI tag of (v_i). (c(v_i) = |v_i|) is the cardinality of (v_i), i.e., the number of reads associated with (v_i) that are equivalent under (\sim_r). (m(v_i)) is the reference set encoded in the mapping information, for (v_i). (D(v_i, v_j)) is the distance between (u(v_i)) and (u(v_j)), where (v_j \in V). Given these function definitions, any two nodes (v_i, v_j \in V) will be incident with a bi-directed edge if and only if (m(v_i) \cap m(v_j) \ne \emptyset) and (D(v_i,v_j) \le \theta), where (\theta) is a distance threshold and is often set as (\theta=1) [Kaminow et al., 2021, Smith et al., 2017, Srivastava et al., 2019]. Additionally, the bi-directed edge might be replaced by a directed edge incident from (v_i) to (v_j) if (c(v_i) \ge 2c(v_j) -1) or vice versa [Smith et al., 2017, Srivastava et al., 2019]. Though these edge definitions are among the most common, others are possible, so long as they are completely defined by the (u), (c), (m), and (D) functions. With (V) and (E) in hand, the UMI graph (G = (V,E)) is now defined. Defining the graph resolution approach Given the defined UMI graph, many different resolution approaches may be applied. A resolution method may be as simple as finding the set of connected components, clustering the graph, greedily collapsing nodes or contracting edges [Smith et al., 2017], or searching for a cover of the graph by structures following certain rules (e.g., monochromatic arboresences [Srivastava et al., 2019]) to reduce the graph. As a result, each node in the reduced UMI graph, or each element in the cover in the case that the graph is not modified dynamically, represents a pre-PCR molecule. The collapsed nodes or covering sets are regarded as the PCR duplicates of that molecule. Different rules for defining the adjacency relationship and different approaches for graph resolution itself can seek to preserve different properties and can define a wide variety of distinct overall UMI resolution approaches. For approaches that probabilistically resolve ambiguity caused by multimapping, the resolved UMI graph may contain multi-gene equivalence classes (ECs), with their gene origins determined in the next step. 3.4.1.1. Quantification# The last step in UMI resolution is quantifying the abundance of each gene using the resolved UMI graph. For approaches that discard multi-gene ECs, the molecule count vector for the genes in the current cell being processed (or count vector for short) is generated by counting the number of ECs labeled with each gene. On the other hand, approaches that process, rather than discard, multi-gene ECs usually resolve the ambiguity by applying some statistical inference procedure. For example, Srivastava et al. [2019] introduce an expectation-maximization (EM) approach for probabilistically assigning multi-gene UMIs, and related EM algorithms have also been introduced as optional steps in subsequent tools [He et al., 2022, Kaminow et al., 2021, Melsted et al., 2021]. In this model, the collapsed-EC-to-gene assignments are latent variables, and the deduplicated molecule count of genes are the main parameters. Intuitively, evidence from gene-unique ECs will be used to help probabilistically apportion the multi-gene ECs. The EM algorithm seeks the parameters that together have the (locally) highest likelihood of generating the observed ECs. Usually, the UMI resolution and quantification process described above will be performed separately for each cell, represented by a corrected CB, to create a complete count matrix for all genes in all cells. However, the relative paucity of per-cell information in high-throughput single-cell samples limits the evidence available when performing UMI resolution, which in turn limits the potential efficacy of model-based solutions like the statistical inference procedure described above. 3.5. Count matrix quality control# Once a count matrix has been generated, it is important to perform a quality control (QC) assessment. There are several distinct assessments that generally fall under the rubric of quality control. Basic global metrics are often recorded and reported to help assess the overall quality of the sequencing measurement itself. These metrics consist of quantities such as the total fraction of mapped reads, the distribution of distinct UMIs observed per cell, the distribution of UMI deduplication rates, the distribution of detected genes per cell, etc. These and similar metrics are often recorded by the quantification tools themselves [He et al., 2022, Kaminow et al., 2021, Melsted et al., 2021, Zheng et al., 2017] since they arise naturally and can be computed during the process of read mapping, cell barcode correction, and UMI resolution. Likewise, there exist several tools to help organize and visualize these basic metrics, such as the Loupe browser, alevinQC, or a kb_python report, depending upon the quantification pipeline being used. Beyond these basic global metrics, at this stage of analysis, QC metrics are designed primarily to help determine which cells (CBs) have been sequenced “successfully”, and which exhibit artifacts that warrant filtering or correction. In the following toggle section, we discuss an example alevinQC report taken from the alevinQC manual webpage. Once alevin or alevin-fry quantifies the single-cell data, the quality of the data can be assessed through the R package alevinQC. The alevinQC report can be generated in PDF format or as R/Shiny applications, which summarizes various components of the single-cell library, such as reads, CBs, and UMIs. 1. Metadata and summary tables Fig. 3.15 An example of the summary section of an alevinQC report.# The first section of an alevinQC report shows a summary of the input files and the processing result, among which, the top left table displays the metadata provided by alevin (or alevin-fry) for the quantification results. For example, this includes the time of the run, the version of the tool, and the path to the input FASTQ and index files. The top right summary table provides the summary statistics for various components of the single-cell library, for example, the number of sequencing reads, the number of selected cell barcodes at various levels of filtering, and the total number of deduplicated UMIs. 2. Knee plot, initial whitelist determination Fig. 3.16 The figure shows the plots in the alevinQC report of an example single-cell dataset, of which the cells are filtered using the “knee” finding method. Each dot represents a corrected cell barcode with its corrected profile.# The first (top left) view in Fig. 3.16 shows the distribution of cell barcode frequency in decreasing order. In all plots shown above, each point represents a corrected cell barcode, with its x-coordinate corresponding to its cell barcode frequency rank. In the top left plot, the y-coordinate corresponds to the observed frequency of the corrected barcode. Generally, this plot shows a “knee”-like pattern, which can be used to identify the initial list of high-quality barcodes. The red dots in the plot represent the cell barcodes selected as the high-quality cell barcodes in the case that “knee”-based filtering was applied. In other words, these cell barcodes contain a sufficient number of reads to be deemed high-quality and likely derived from truly present cells. Suppose an external permit list is passed in the CB correction step, which implies no internal algorithm was used to distinguish high-quality cell barcodes. In that case, all dots in the plot will be colored red, as all these corrected cell barcodes are processed throughout the raw data processing pipeline and reported in the gene count matrix. One should be skeptical of the data quality if the frequency is consistently low across all cell barcodes. 3. Barcode collapsing After identification of the barcodes that will be processed, either through an internal threshold (e.g., from the “knee”-based method) or through external whitelisting, alevin (or alevin-fry) performs cell barcode sequence correction. The barcode collapsing plot, the upper middle plot in the Fig. 3.16, shows the number of reads assigned to a cell barcode after sequence correction of the cell barcodes versus prior to correction. Generally, we would see that all points fall close to the line representing (x = y), which means that the reassignments in CB correction usually do not drastically change the profile of the cell barcodes. 4. Knee Plot, number of genes per cell The upper right plot in Fig. 3.16 shows the distribution of the number of observed genes of all processed cell barcodes. Generally, a mean of (2,000) genes per cell is considered modest but reasonable for the downstream analyses. One should double-check the quality of the data if all cells have a low number of observed genes. 5. Quantification summary Finally, a series of quantification summary plots, the bottom plots in Fig. 3.16, compare the cell barcode frequency, the total number of UMIs after deduplication and the total number of non-zero genes using scatter plots. In general, in each plot, the plotted data should demonstrate a positive correlation, and, if high-quality filtering (e.g., knee filtering) has been performed, the high-quality cell barcodes should be well separated from the rest. Moreover, one should expect all three plots to convey similar trends. If using an external permit list, all the dots in the plots will be colored red, as all these cell barcodes are processed and reported in the gene count matrix. Still, we should see the correlation between the plots and the separation of the dots representing high-quality cells to others. If all of these metrics are consistently low across cells or if these plots convey substantially different trends, then one should be concerned about the data quality. 3.5.1. Empty droplet detection# One of the first QC steps is determining which cell barcodes correspond to “high-confidence” sequenced cells. It is common in droplet-based protocols [Macosko et al., 2015] that certain barcodes are associated with ambient RNA instead of the RNA of a captured cell. This happens when droplets fail to capture a cell. These empty droplets still tend to produce sequenced reads, although the characteristics of these reads look markedly different from those associated with barcodes corresponding to properly captured cells. Many approaches exist to assess whether a barcode likely corresponds to an empty droplet or not. One simple method is to examine the cumulative frequency plot of the barcodes, in which barcodes are sorted in descending order of the number of distinct UMIs with which they are associated. This plot often contains a “knee” that can be identified as a likely point of discrimination between properly captured cells and empty droplets [He et al., 2022, Smith et al., 2017]. While this “knee” method is intuitive and can often estimate a reasonable threshold, it has several drawbacks. For example, not all cumulative histograms display an obvious knee, and it is notoriously difficult to design algorithms that can robustly and automatically detect such knees. Finally, the total UMI count associated with a barcode may not, alone, be the best signal to determine if the barcode was associated with an empty or damaged cell. This led to the development of several tools specifically designed to detect empty or damaged droplets, or cells generally deemed to be of “low quality” [Alvarez et al., 2020, Heiser et al., 2021, Hippen et al., 2021, Lun et al., 2019, Muskovic and Powell, 2021, Young and Behjati, 2020]. These tools incorporate a variety of different measures of cell quality, including the frequencies of distinct UMIs, the number of detected genes, and the fraction of mitochondrial RNA, and typically work by applying a statistical model to these features to classify high-quality cells from putative empty droplets or damaged cells. This means that cells can typically be scored, and a final filtering can be selected based on an estimated posterior probability that cells are not empty or compromised. While these models generally work well for single-cell RNA-seq data, one may have to apply several additional filters or heuristics to obtain robust filtering in single-nucleus RNA-seq data [He et al., 2022, Kaminow et al., 2021], like those exposed in the emptyDropsCellRanger function of DropletUtils [Lun et al., 2019]. 3.5.2. Doublet detection# In addition to determining which cell barcodes correspond to empty droplets or damaged cells, one may also wish to identify those cell barcodes that correspond to doublets or multiplets. When a given droplet captures two (doublets) or more (multiplets) cells, this can result in a skewed distribution for these cell barcodes in terms of quantities like the number of reads and UMIs they represent, as well as gene expression profiles they display. Many tools have also been developed to predict the doublet status of cell barcodes [Bais and Kostka, 2019, Bernstein et al., 2020, DePasquale et al., 2019, McGinnis et al., 2019, Wolock et al., 2019]. Once detected, cells determined to likely be doublets and multiplets can be removed or otherwise adjusted for in the subsequent analysis. 3.6. Count data representation# As one completes initial raw data processing and quality control and moves on to subsequent analyses, it is important to acknowledge and remember that the cell-by-gene count matrix is, at best, an approximation of the sequenced molecules in the original sample. At several stages of the raw data processing pipeline, heuristics are applied, and simplifications are made to enable the generation of this count matrix. For example, read mapping is imperfect, as is cell barcode correction. Accurately resolving UMIs is particularly challenging, and issues related to UMIs attached to multimapping reads are often overlooked. Additionally, multiple priming sites, especially in unspliced molecules, can violate the commonly assumed one molecule-to-one UMI relationship. 3.7. Brief discussion# To close this chapter, we convey some observations and suggestions that have arisen from recent benchmarking and review studies surrounding some of the common preprocessing tools described above [Brüning et al., 2022, You et al., 2021]. It is, of course, important to note that the development of methods and tools for single-cell and single-nucleus RNA-seq raw data processing, as well as the continual evaluation of such methods, is an ongoing community effort. It is therefore often useful and reasonable, when performing your own analyses, to experiment with several different tools. At the coarsest level, the most common tools can process data robustly and accurately. It has been suggested that with many common downstream analyses like clustering, and the methods used to perform them, the choice of preprocessing tool typically makes less difference than other steps in the analysis process [You et al., 2021]. Nonetheless, it has also been observed that applying lightweight mapping restricted to the spliced transcriptome can increase the probability of spurious mapping and gene expression [Brüning et al., 2022]. Ultimately, the choice of a specific tool largely depends on the task at hand, and the constraints on available computational resources. If performing a standard single-cell analysis, lightweight mapping-based methods are a good choice since they are faster (often considerably so) and more memory frugal than existing alignment-based tools. If performing single-nucleus RNA-seq analysis, alevin-fry is an attractive option in particular, as it remains memory frugal and its index remains relatively small even as the transcriptome reference is expanded to include unspliced reference sequence. On the other hand, alignment-based methods are recommended when recovering reads that map outside the (extended) transcriptome is important or when genomic mapping sites are required for downstream analyses. This is particularly relevant for tasks such as differential transcript usage analysis using tools like sierra [Patrick et al., 2020]. Among the alignment-based pipelines, according to Brüning et al. [2022], STARsolo should be favored over Cell Ranger because the former is much faster than the latter, and requires less memory, while it is also capable of producing almost identical results. 3.8. A real-world example# Given that we have covered the concepts underlying various approaches for raw data processing, we now turn our attention to demonstrating how a specific tool (in this case, alevin-fry) can be used to process a small example dataset. To start, we need the sequenced reads from a single-cell experiment in FASTQ format and the reference (e.g., transcriptome) against which the reads will be mapped. Usually, a reference includes the genome sequences and the corresponding gene annotations of the sequenced species in the FASTA and GTF format, respectively. In this example, we will use chromosome 5 of the human genome and its related gene annotations as the reference, which is a subset of the Human reference, GRCh38 (GENCODE v32/Ensembl 98) reference from the 10x Genomics reference build. Correspondingly, we extract the subset of reads that map to the generated reference from a human brain tumor dataset from 10x Genomics. Alevin-fry [He et al., 2022] is a fast, accurate, and memory-frugal single-cell and single-nucleus data processing tool. Simpleaf is a program, written in rust, that exposes a unified and simplified interface for processing some of the most common protocols and data types using the alevin-fry pipeline. A nextflow-based workflow tool also exists to process extensive collections of single-cell data. Here we will first show how to process single-cell raw data using two simpleaf commands. Then, we describe the complete set of salmon alevin and alevin-fry commands to which these simpleaf commands correspond, to outline where the steps described in this section occur and to convey the possible different processing options. These commands will be run from the command line, and conda will be used for installing all of the software required for running this example. 3.8.1. Preparation# Before we start, we create a conda environment in the terminal and install the required package. Simpleaf depends on alevin-fry, salmon and pyroe. They are all available on bioconda and will be automatically installed when installing simpleaf. conda create -n af -y -c bioconda simpleaf conda activate af Note on using an Apple silicon-based device Conda does not currently build most packages natively for Apple silicon. Therefore, if you are using a non-Intel-based Apple computer (e.g., with an M1 (Pro/Max/Ultra) or M2 chip), you should make sure to specify that your environment uses the Rosetta2 translation layer. To do this, you can replace the above commands with the following (instructions adopted from here): CONDA_SUBDIR=osx-64 conda create -n af -y -c bioconda simpleaf # create a new environment conda activate af conda env config vars set CONDA_SUBDIR=osx-64 # subsequent commands use intel packages Next, we create a working directory, af_xmpl_run, and download and uncompress the example dataset from a remote host. # Create a working dir and go to the working directory ## The && operator helps execute two commands using a single line of code. mkdir af_xmpl_run && cd af_xmpl_run # Fetch the example dataset and CB permit list and decompress them ## The pipe operator (|) passes the output of the wget command to the tar command. ## The dash operator (-) after tar xzf captures the output of the first command. ## - example dataset wget -qO- https://umd.box.com/shared/static/lx2xownlrhz3us8496tyu9c4dgade814.gz | tar xzf - --strip-components=1 -C . ## The fetched folder containing the fastq files is called toy_read_fastq. fastq_dir="toy_read_fastq" ## The fetched folder containing the human ref files is called toy_human_ref. ref_dir="toy_human_ref" # Fetch CB permit list ## the right chevron (>) redirects the STDOUT to a file. wget -qO- https://raw.githubusercontent.com/10XGenomics/cellranger/master/lib/python/cellranger/barcodes/3M-february-2018.txt.gz | gunzip - > 3M-february-2018.txt With the reference files (the genome FASTA file and the gene annotation GTF file) and read records (the FASTQ files) ready, we can now apply the raw data processing pipeline discussed above to generate the gene count matrix. 3.8.2. Simplified raw data processing pipeline# Simpleaf is designed to simplify the alevin-fry interface for single-cell and nucleus raw data processing. It encapsulates the whole processing pipeline into two steps: simpleaf index indexes the provided reference or makes a splici reference (spliced transcripts + introns) and index it. simpleaf quant maps the sequencing reads against the indexed reference and quantifies the mapping records to generate a gene count matrix. More advanced usages and options for mapping with simpleaf can be found here. When running simpleaf index, if a genome FASTA file (-f) and a gene annotation GTF file(-g) are provided, it will generate a splici reference and index it; if only a transcriptome FASTA file is provided (--refseq), it will directly index it. Currently, we recommend the splici index. # simpleaf needs the environment variable ALEVIN_FRY_HOME to store configuration and data. # For example, the paths to the underlying programs it uses and the CB permit list mkdir alevin_fry_home && export ALEVIN_FRY_HOME='alevin_fry_home' # the simpleaf set-paths command finds the path to the required tools and writes a configuration JSON file in the ALEVIN_FRY_HOME folder. simpleaf set-paths # simpleaf index # Usage: simpleaf index -o out_dir [-f genome_fasta -g gene_annotation_GTF|--refseq transcriptome_fasta] -r read_length -t number_of_threads ## The -r read_lengh is the number of sequencing cycles performed by the sequencer to generate biological reads (read2 in Illumina). ## Publicly available datasets usually have the read length in the description. Sometimes they are called the number of cycles. simpleaf index \ -o simpleaf_index \ -f toy_human_ref/fasta/genome.fa \ -g toy_human_ref/genes/genes.gtf \ -r 90 \ -t 8 In the output directory simpleaf_index, the ref folder contains the splici reference; The index folder contains the salmon index built upon the splici reference. The next step, simpleaf quant, consumes an index directory and the mapping record FASTQ files to generate a gene count matrix. This command encapsulates all the major steps discussed in this section, including mapping, cell barcode correction, and UMI resolution. # Collecting sequencing read files ## The reads1 and reads2 variables are defined by finding the filenames with the pattern "R1" and "R2" from the toy_read_fastq directory. reads1_pat="R1" reads2_pat="R2" ## The read files must be sorted and separated by a comma. ### The find command finds the files in the fastq_dir with the name pattern ### The sort command sorts the file names ### The awk command and the paste command together convert the file names into a comma-separated string. reads1="$(find -L ${fastq_dir} -name "$reads1_pat" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd,)" reads2="$(find -L ${fastq_dir} -name "$reads2_pat" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd,)" # simpleaf quant ## Usage: simpleaf quant -c chemistry -t threads -1 reads1 -2 reads2 -i index -u [unspliced permit list] -r resolution -m t2g_3col -o output_dir simpleaf quant \ -c 10xv3 -t 8 \ -1 $reads1 -2 $reads2 \ -i simpleaf_index/index \ -u -r cr-like \ -m simpleaf_index/index/t2g_3col.tsv \ -o simpleaf_quant After running these commands, the resulting quantification information can be found in the simpleaf_quant/af_quant/alevin folder. Within this directory, there are three files: quants_mat.mtx, quants_mat_cols.txt, and quants_mat_rows.txt, which correspond, respectively, to the count matrix, the gene names for each column of this matrix, and the corrected, filtered cell barcodes for each row of this matrix. The tail lines of these files are shown below. Of note here is the fact that alevin-fry was run in the USA-mode (unspliced, spliced, and ambiguous mode), and so quantification was performed for both the spliced and unspliced status of each gene — the resulting quants_mat_cols.txt file will then have a number of rows equal to 3 times the number of annotated genes which correspond, to the names used for the spliced (S), unspliced (U), and splicing-ambiguous variants (A) of each gene. # Each line in quants_mat.mtx represents # a non-zero entry in the format row column entry $ tail -3 simpleaf_quant/af_quant/alevin/quants_mat.mtx 138 58 1 139 9 1 139 37 1 # Each line in quants_mat_cols.txt is a splice status # of a gene in the format (gene name)-(splice status) $ tail -3 simpleaf_quant/af_quant/alevin/quants_mat_cols.txt ENSG00000120705-A ENSG00000198961-A ENSG00000245526-A # Each line in quants_mat_rows.txt is a corrected # (and, potentially, filtered) cell barcode $ tail -3 simpleaf_quant/af_quant/alevin/quants_mat_rows.txt TTCGATTTCTGAATCG TGCTCGTGTTCGAAGG ACTGTGAAGAAATTGC We can load the count matrix into Python as an AnnData object using the load_fry function from pyroe. A similar function, loadFry, has been implemented in the fishpond R package. import pyroe quant_dir = 'simpleaf_quant/af_quant' adata_sa = pyroe.load_fry(quant_dir) The default behavior loads the X layer of the Anndata object as the sum of the spliced and ambiguous counts for each gene. However, recent work [Pool et al., 2022] and updated practices suggest that the inclusion of intronic counts, even in single-cell RNA-seq data, may increase sensitivity and benefit downstream analyses. While the best way to make use of this information is the subject of ongoing research, since alevin-fry automatically quantifies spliced, unspliced, and ambiguous reads in each sample, the count matrix containing the total counts for each gene can be simply obtained as follows: import pyroe quant_dir = 'simpleaf_quant/af_quant' adata_usa = pyroe.load_fry(quant_dir, output_format={'X' : ['U','S','A']}) 3.8.3. The complete alevin-fry pipeline# Simpleaf makes it possible to process single-cell raw data in the “standard” way with a few commands. Next, we will show how to generate the identical quantification result by explicitly calling the pyroe, salmon, and alevin-fry commands. On top of the pedagogical value, knowing the exact command of each step will be helpful if only a part of the pipeline needs to be rerun or if some parameters not currently exposed by simpleaf need to be specified. Please note that the commands in the Preparation section should be executed in advance. All the tools called in the following commands, pyroe, salmon, and alevin-fry, have already been installed when installing simpleaf. 3.8.3.1. Building the index# First, we process the genome FASTA file and gene annotation GTF file to obtain the splici index. The commands in the following code chunk are analogous to the simpleaf index command discussed above. This includes two steps: Building the splici reference (spliced transcripts + introns) by calling pyroe make-splici, using the genome and gene annotation file Indexing the splici reference by calling salmon index # make splici reference ## Usage: pyroe make-splici genome_file gtf_file read_length out_dir ## The read_lengh is the number of sequencing cycles performed by the sequencer. Ask your technician if you are not sure about it. ## Publicly available datasets usually have the read length in the description. pyroe make-splici \ ${ref_dir}/fasta/genome.fa \ ${ref_dir}/genes/genes.gtf \ 90 \ splici_rl90_ref # Index the reference ## Usage: salmon index -t extend_txome.fa -i idx_out_dir -p num_threads ## The $() expression runs the command inside and puts the output in place. ## Please ensure that only one file ends with ".fa" in the splici_ref folder. salmon index \ -t $(ls splici_rl90_ref/.fa) \ -i salmon_index \ -p 8 The splici index can be found in the salmon_index directory. 3.8.3.2. Mapping and quantification# Next, we will map the sequencing reads recorded against the splici index by calling salmon alevin. This will produce an output folder called salmon_alevin that contains all the information we need to process the mapped reads using alevin-fry. # Collect FASTQ files ## The filenames are sorted and separated by space. reads1="$(find -L $fastq_dir -name "$reads1_pat*" -type f | sort | awk '{$1=$1;print}' | paste -sd' ')" reads2="$(find -L $fastq_dir -name "$reads2_pat" -type f | sort | awk '{$1=$1;print}' | paste -sd' ')" # Mapping ## Usage: salmon alevin -i index_dir -l library_type -1 reads1_files -2 reads2_files -p num_threads -o output_dir ## The variable reads1 and reads2 defined above are passed in using ${}. salmon alevin \ -i salmon_index \ -l ISR \ -1 ${reads1} \ -2 ${reads2} \ -p 8 \ -o salmon_alevin \ --chromiumV3 \ --sketch Then, we execute the cell barcode correction and UMI resolution step using alevin-fry. This procedure involves three alevin-fry commands: The generate-permit-list command is used for cell barcode correction. The collate command filters out invalid mapping records, corrects cell barcodes and collates mapping records originating from the same corrected cell barcode. The quant command performs UMI resolution and quantification. # Cell barcode correction ## Usage: alevin-fry generate-permit-list -u CB_permit_list -d expected_orientation -o gpl_out_dir ## Here, the reads that map to the reverse complement strand of transcripts are filtered out by specifying -d fw. alevin-fry generate-permit-list \ -u 3M-february-2018.txt \ -d fw \ -i salmon_alevin \ -o alevin_fry_gpl # Filter mapping information ## Usage: alevin-fry collate -i gpl_out_dir -r alevin_map_dir -t num_threads alevin-fry collate \ -i alevin_fry_gpl \ -r salmon_alevin \ -t 8 # UMI resolution + quantification ## Usage: alevin-fry quant -r resolution -m txp_to_gene_mapping -i gpl_out_dir -o quant_out_dir -t num_threads ## The file ends with 3col.tsv in the splici_ref folder will be passed to the -m argument. ## Please ensure that there is only one such file in the splici_ref folder. alevin-fry quant -r cr-like \ -m $(ls splici_rl90_ref/*3col.tsv) \ -i alevin_fry_gpl \ -o alevin_fry_quant \ -t 8 After running these commands, the resulting quantification information can be found in alevin_fry_quant/alevin. Other relevant information concerning the mapping, CB correction, and UMI resolution steps can be found in the salmon_alevin, alevin_fry_gpl, and alevin_fry_quant folders, respectively. In the example given here, we demonstrate using simpleaf and alevin-fry to process a 10x Chromium 3’ v3 dataset. Alevin-fry and simpleaf provide many other options for processing different single-cell protocols, including but not limited to Dropseq [Macosko et al., 2015], sci-RNA-seq3 [Cao et al., 2019] and other 10x Chromium platforms. A more comprehensive list and description of available options for different stages of processing can be found in the alevin-fry and simpleaf documentation. alevin-fry also provides a nextflow-based workflow, called quantaf, for conveniently processing many samples from a simply-defined sample sheet. Of course, similar resources exist for many of the other raw data processing tools referenced and described throughout this section, including zUMIs [Parekh et al., 2018], alevin [Srivastava et al., 2019], kallisto|bustools [Melsted et al., 2021], STARsolo [Kaminow et al., 2021] and CellRanger. The scrnaseq pipeline from nf-core also provides a nextflow-based pipeline for processing single-cell RNA-seq data generated using a range of different chemistries and integrates several of the tools described in this section. 3.9. Useful links# Alevin-fry tutorials provide tutorials for processing different types of data. Pyroe in python and roe in R provide helper functions for processing alevin-fry quantification information. They also provide an interface to the preprocessed datasets in quantaf. Quantaf is a nextflow-based workflow of the alevin-fry pipeline for conveniently processing a large number of single-cell and single-nucleus data based on the input sheets. The preprocessed quantification information of publicly available single-cell datasets is available on its webpage. Simpleaf is a wrapper of the alevin-fry workflow that allows executing the whole pipeline, from making splici reference to quantification as shown in the above example, using only two commands. Tutorials for processing scRNA-seq raw data from the galaxy project can be found at here and here. Tutorials for explaining and evaluating FastQC report are available from MSU, the HBC training program, Galaxy Training and the QC Fail website. 3.10. References# [rawARJ+20] Marcus Alvarez, Elior Rahmani, Brandon Jew, Kristina M. Garske, Zong Miao, Jihane N. Benhammou, Chun Jimmie Ye, Joseph R. Pisegna, Kirsi H. Pietiläinen, Eran Halperin, and Päivi Pajukanta. Enhancing droplet-based single-nucleus termRNA-seq resolution using the semi-supervised machine learning classifier DIEM. Scientific Reports, July 2020. URL: https://doi.org/10.1038/s41598-020-67513-5, doi:10.1038/s41598-020-67513-5. [rawBK19] (1,2) Abha S Bais and Dennis Kostka. Scds: computational annotation of doublets in single-cell termRNA sequencing data. Bioinformatics, 36(4):1150–1158, September 2019. URL: https://doi.org/10.1093/bioinformatics/btz698, doi:10.1093/bioinformatics/btz698. [rawBFL+20] Nicholas J. Bernstein, Nicole L. Fong, Irene Lam, Margaret A. Roy, David G. Hendrickson, and David R. Kelley. Solo: Doublet Identification in Single-Cell termRNA-Seq via Semi-Supervised Deep Learning. Cell Systems, 11(1):95–101.e5, July 2020. URL: https://doi.org/10.1016/j.cels.2020.05.010, doi:10.1016/j.cels.2020.05.010. [rawBWC+15] Sayantan Bose, Zhenmao Wan, Ambrose Carr, Abbas H. Rizvi, Gregory Vieira, Dana Pe'er, and Peter A. Sims. Scalable microfluidics for single-cell termRNA printing and sequencing. Genome Biology, June 2015. URL: https://doi.org/10.1186/s13059-015-0684-3, doi:10.1186/s13059-015-0684-3. [rawBPMP16] Nicolas L Bray, Harold Pimentel, Páll Melsted, and Lior Pachter. Near-optimal probabilistic termRNA-seq quantification. Nature biotechnology, 34(5):525–527, 2016. [rawBruningTS+22] Ralf Schulze Brüning, Lukas Tombor, Marcel H Schulz, Stefanie Dimmeler, and David John. Comparative analysis of common alignment tools for single-cell termRNA sequencing. GigaScience, 2022. [rawBTS+22] (1,2,3) Ralf Schulze Brüning, Lukas Tombor, Marcel H Schulz, Stefanie Dimmeler, and David John. Comparative analysis of common alignment tools for single-cell termRNA sequencing. GigaScience, 2022. URL: https://doi.org/10.1093%2Fgigascience%2Fgiac001, doi:10.1093/gigascience/giac001. [rawCSQ+19] Junyue Cao, Malte Spielmann, Xiaojie Qiu, Xingfan Huang, Daniel M. Ibrahim, Andrew J. Hill, Fan Zhang, Stefan Mundlos, Lena Christiansen, Frank J. Steemers, Cole Trapnell, and Jay Shendure. The single-cell transcriptional landscape of mammalian organogenesis. Nature, 566(7745):496–502, February 2019. URL: https://doi.org/10.1038/s41586-019-0969-x, doi:10.1038/s41586-019-0969-x. [rawCPM92] Kun-Mao Chao, William R. Pearson, and Webb Miller. Aligning two sequences within a specified diagonal band. Bioinformatics, 8(5):481–487, 1992. URL: https://doi.org/10.1093/bioinformatics/8.5.481, doi:10.1093/bioinformatics/8.5.481. [rawDSC+19] (1,2) Erica A.K. DePasquale, Daniel J. Schnell, Pieter-Jan Van Camp, Íñigo Valiente-Aland'ı, Burns C. Blaxall, H. Leighton Grimes, Harinder Singh, and Nathan Salomonis. DoubletDecon: deconvoluting doublets from single-cell termRNA-sequencing data. Cell Reports, 29(6):1718–1727.e8, November 2019. URL: https://doi.org/10.1016/j.celrep.2019.09.082, doi:10.1016/j.celrep.2019.09.082. [rawDDS+13] Alexander Dobin, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R Gingeras. Star: ultrafast universal rna-seq aligner. Bioinformatics, 29(1):15–21, 2013. [rawFDF+20] (1,2) Rick Farouni, Haig Djambazian, Lorenzo E Ferri, Jiannis Ragoussis, and Hamed S Najafabadi. Model-based analysis of sample index hopping reveals its widespread artifacts in multiplexed single-cell termRNA-sequencing. Nature communications, 11(1):1–8, 2020. [rawFar07] Michael Farrar. Striped Smith–Waterman speeds database searches six times over other SIMD implementations. Bioinformatics, 23(2):156–161, 2007. [rawGP21] Gennady Gorin and Lior Pachter. Length Biases in Single-Cell termRNA Sequencing of pre-mRNA. bioRxiv, 2021. URL: https://www.biorxiv.org/content/early/2021/07/31/2021.07.30.454514, arXiv:https://www.biorxiv.org/content/early/2021/07/31/2021.07.30.454514.full.pdf, doi:10.1101/2021.07.30.454514. [rawHZS+22] (1,2,3,4,5,6,7,8,9,10,11,12,13,14) Dongze He, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson, Avi Srivastava, and Rob Patro. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nature Methods, 19(3):316–322, 2022. [rawHWC+21] Cody N. Heiser, Victoria M. Wang, Bob Chen, Jacob J. Hughey, and Ken S. Lau. Automated quality control and cell identification of droplet-based single-cell data using dropkick. Genome Research, 31(10):1742–1752, April 2021. URL: https://doi.org/10.1101/gr.271908.120, doi:10.1101/gr.271908.120. [rawHFW+21] Ariel A Hippen, Matias M Falco, Lukas M Weber, Erdogan Pekcan Erkan, Kaiyang Zhang, Jennifer Anne Doherty, Anna Vähärautio, Casey S Greene, and Stephanie C Hicks. miQC: An adaptive probabilistic framework for quality control of single-cell termRNA-sequencing data. PLoS computational biology, 17(8):e1009290, 2021. [rawIZJ+13] Saiful Islam, Amit Zeisel, Simon Joost, Gioele La Manno, Pawel Zajac, Maria Kasper, Peter Lönnerberg, and Sten Linnarsson. Quantitative single-cell termRNA-seq with unique molecular identifiers. Nature Methods, 11(2):163–166, December 2013. URL: https://doi.org/10.1038/nmeth.2772, doi:10.1038/nmeth.2772. [rawKYD21] (1,2,3,4,5,6,7,8,9,10,11,12) Benjamin Kaminow, Dinar Yunusov, and Alexander Dobin. STARsolo: accurate, fast and versatile mapping/quantification of single-cell and single-nucleus termRNA-seq data. bioRxiv, 2021. [rawLi18] Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094–3100, May 2018. URL: https://doi.org/10.1093/bioinformatics/bty191, doi:10.1093/bioinformatics/bty191. [rawLRA+19] (1,2,3,4) Aaron TL Lun, Samantha Riesenfeld, Tallulah Andrews, Tomas Gomes, John C Marioni, and others. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell termRNA sequencing data. Genome biology, 20(1):1–9, 2019. [rawMBS+15] (1,2,3) Evan Z. Macosko, Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa Goldman, Itay Tirosh, Allison R. Bialas, Nolan Kamitaki, Emily M. Martersteck, John J. Trombetta, David A. Weitz, Joshua R. Sanes, Alex K. Shalek, Aviv Regev, and Steven A. McCarroll. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5):1202–1214, May 2015. URL: https://doi.org/10.1016/j.cell.2015.05.002, doi:10.1016/j.cell.2015.05.002. [rawMSMME20] Santiago Marco-Sola, Juan Carlos Moure, Miquel Moreto, and Antonio Espinosa. Fast gap-affine pairwise alignment using the wavefront algorithm. Bioinformatics, September 2020. URL: https://doi.org/10.1093/bioinformatics/btaa777, doi:10.1093/bioinformatics/btaa777. [rawMMG19] (1,2) Christopher S. McGinnis, Lyndsay M. Murrow, and Zev J. Gartner. DoubletFinder: doublet detection in single-cell termRNA sequencing data using artificial nearest neighbors. Cell Systems, 8(4):329–337.e4, April 2019. URL: https://doi.org/10.1016/j.cels.2019.03.003, doi:10.1016/j.cels.2019.03.003. [rawMBL+21] (1,2,3,4,5,6,7) Páll Melsted, A. Sina Booeshaghi, Lauren Liu, Fan Gao, Lambda Lu, Kyung Hoi Min, Eduardo da Veiga Beltrame, Kristján Eldjárn Hjörleifsson, Jase Gehring, and Lior Pachter. Modular, efficient and constant-memory single-cell rna-seq preprocessing. Nature Biotechnology, 39(7):813–818, April 2021. URL: https://doi.org/10.1038/s41587-021-00870-2, doi:10.1038/s41587-021-00870-2. [rawMP21] (1,2) Walter Muskovic and Joseph E Powell. DropletQC: improved identification of empty droplets and damaged cells in single-cell termRNA-seq data. Genome Biology, 22(1):1–9, 2021. [rawNMullerHS20] Stefan Niebler, André Müller, Thomas Hankeln, and Bertil Schmidt. RainDrop: Rapid activation matrix computation for droplet-based single-cell termRNA-seq reads. BMC bioinformatics, 21(1):1–14, 2020. [rawOEM+18] Baraa Orabi, Emre Erhan, Brian McConeghy, Stanislav V Volik, Stephane Le Bihan, Robert Bell, Colin C Collins, Cedric Chauve, and Faraz Hach. Alignment-free clustering of UMI tagged termDNA molecules. Bioinformatics, 35(11):1829–1836, October 2018. URL: https://doi.org/10.1093/bioinformatics/bty888, doi:10.1093/bioinformatics/bty888. [rawPZV+18] (1,2,3,4) Swati Parekh, Christoph Ziegenhain, Beate Vieth, Wolfgang Enard, and Ines Hellmann. zUMIs - a fast and flexible pipeline to process termRNA sequencing data with UMIs. GigaScience, May 2018. URL: https://doi.org/10.1093/gigascience/giy059, doi:10.1093/gigascience/giy059. [rawPHJ+20] Ralph Patrick, David T. Humphreys, Vaibhao Janbandhu, Alicia Oshlack, Joshua W.K. Ho, Richard P. Harvey, and Kitty K. Lo. Sierra: discovery of differential transcript usage from polyA-captured single-cell termRNA-seq data. Genome Biology, jul 2020. URL: https://doi.org/10.1186%2Fs13059-020-02071-7, doi:10.1186/s13059-020-02071-7. [rawPPC+22] (1,2,3) Allan-Hermann Pool, Helen Poldsam, Sisi Chen, Matt Thomson, and Yuki Oka. Enhanced recovery of single-cell termRNA-sequencing reads for missing gene expression data. bioRxiv, 2022. URL: https://www.biorxiv.org/content/early/2022/04/27/2022.04.26.489449, arXiv:https://www.biorxiv.org/content/early/2022/04/27/2022.04.26.489449.full.pdf, doi:10.1101/2022.04.26.489449. [rawRS00] Torbjørn Rognes and Erling Seeberg. Six-fold speed-up of Smith–Waterman sequence database searches using parallel processing on common microprocessors. Bioinformatics, 16(8):699–706, 2000. [rawSHS17] (1,2,3,4,5,6,7,8,9,10,11) Tom Smith, Andreas Heger, and Ian Sudbery. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome research, 27(3):491–499, 2017. [rawSSPS21] Charlotte Soneson, Avi Srivastava, Rob Patro, and Michael B Stadler. Preprocessing choices affect RNA velocity results for droplet scRNA-seq data. PLoS computational biology, 17(1):e1008585, 2021. [rawSMS+20] Avi Srivastava, Laraib Malik, Hirak Sarkar, Mohsen Zakeri, Fatemeh Almodaresi, Charlotte Soneson, Michael I Love, Carl Kingsford, and Rob Patro. Alignment and mapping methodology influence transcript abundance estimation. Genome Biology, 21(1):1–29, 2020. [rawSMS+19] (1,2,3,4,5,6,7,8,9,10,11,12,13,14) Avi Srivastava, Laraib Malik, Tom Smith, Ian Sudbery, and Rob Patro. Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome biology, 20(1):1–16, 2019. [rawSSGP16] Avi Srivastava, Hirak Sarkar, Nitish Gupta, and Rob Patro. Rapmap: a rapid, sensitive and accurate tool for mapping termrna-seq reads to transcriptomes. Bioinformatics, 32(12):i192–i200, 2016. [rawSK18] Hajime Suzuki and Masahiro Kasahara. Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC Bioinformatics, February 2018. URL: https://doi.org/10.1186/s12859-018-2014-8, doi:10.1186/s12859-018-2014-8. [rawTMP+21] (1,2) Maria Tsagiopoulou, Maria Christina Maniou, Nikolaos Pechlivanis, Anastasis Togkousidis, Michaela Kotrová, Tobias Hutzenlaub, Ilias Kappas, Anastasia Chatzidimitriou, and Fotis Psomopoulos. UMIc: a preprocessing method for UMI deduplication and reads correction. Frontiers in Genetics, May 2021. URL: https://doi.org/10.3389/fgene.2021.660366, doi:10.3389/fgene.2021.660366. [rawWLK19] (1,2) Samuel L. Wolock, Romain Lopez, and Allon M. Klein. Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. Cell Systems, 8(4):281–291.e9, April 2019. URL: https://doi.org/10.1016/j.cels.2018.11.005, doi:10.1016/j.cels.2018.11.005. [rawWoz97] Andrzej Wozniak. Using video-oriented instructions to speed up sequence comparison. Bioinformatics, 13(2):145–150, 1997. [rawYTS+21] (1,2) Yue You, Luyi Tian, Shian Su, Xueyi Dong, Jafar S. Jabbari, Peter F. Hickey, and Matthew E. Ritchie. Benchmarking UMI-based single-cell termRNA-seq preprocessing workflows. Genome Biology, dec 2021. URL: https://doi.org/10.1186%2Fs13059-021-02552-3, doi:10.1186/s13059-021-02552-3. [rawYB20] (1,2) Matthew D Young and Sam Behjati. SoupX removes ambient termRNA contamination from droplet-based single-cell termRNA sequencing data. GigaScience, December 2020. URL: https://doi.org/10.1093/gigascience/giaa151, doi:10.1093/gigascience/giaa151. [rawZT21] Luke Zappia and Fabian J. Theis. Over 1000 tools reveal trends in the single-cell rna-seq analysis landscape. Genome Biology, 22(1):301, Oct 2021. URL: https://doi.org/10.1186/s13059-021-02519-4, doi:10.1186/s13059-021-02519-4. [rawZSWM00] Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller. A greedy algorithm for aligning termDNA sequences. Journal of Computational Biology, 7(1-2):203–214, February 2000. URL: https://doi.org/10.1089/10665270050081478, doi:10.1089/10665270050081478. [rawZTB+17] (1,2,3,4) Grace X. Y. Zheng, Jessica M. Terry, Phillip Belgrader, Paul Ryvkin, Zachary W. Bent, Ryan Wilson, Solongo B. Ziraldo, Tobias D. Wheeler, Geoff P. McDermott, Junjie Zhu, Mark T. Gregory, Joe Shuga, Luz Montesclaros, Jason G. Underwood, Donald A. Masquelier, Stefanie Y. Nishimura, Michael Schnall-Levin, Paul W. Wyatt, Christopher M. Hindson, Rajiv Bharadwaj, Alexander Wong, Kevin D. Ness, Lan W. Beppu, H. Joachim Deeg, Christopher McFarland, Keith R. Loeb, William J. Valente, Nolan G. Ericson, Emily A. Stevens, Jerald P. Radich, Tarjei S. Mikkelsen, Benjamin J. Hindson, and Jason H. Bielas. Massively parallel digital transcriptional profiling of single cells. Nature Communications, 8(1):14049, Jan 2017. URL: https://doi.org/10.1038/ncomms14049, doi:10.1038/ncomms14049. [rawZHHJS22] Christoph Ziegenhain, Gert-Jan Hendriks, Michael Hagemann-Jensen, and Rickard Sandberg. Molecular spikes: a gold standard for single-cell termRNA counting. Nature Methods, 19(5):560–566, 2022. [raw10xGenomics21] (1,2,3,4,5) 10x Genomics. Technical note - interpreting intronic and antisense reads in 10x genomics single cell gene expression data. https://www.10xgenomics.com/support/single-cell-gene-expression/documentation/steps/sequencing/interpreting-intronic-and-antisense-reads-in-10-x-genomics-single-cell-gene-expression-data, August 2021. 3.11. Contributors# We gratefully acknowledge the contributions of: 3.11.1. Authors# Dongze He Avi Srivastava Hirak Sarkar Rob Patro Seo H. Kim 3.11.2. Reviewers# Lukas Heumos previous 2. Single-cell RNA sequencing next 4. Analysis frameworks and tools Contents 3.1. Raw data quality control 3.2. Alignment and mapping 3.2.1. Types of mapping 3.2.2. Mapping against different reference sequences 3.2.2.1. Mapping to the full genome 3.2.2.2. Mapping to the spliced transcriptome 3.2.2.3. Mapping to an augmented transcriptome 3.3. Cell barcode correction 3.3.1. Type of errors in barcoding 3.4. UMI resolution 3.4.1. The need for UMI resolution 3.4.1.1. Quantification 3.5. Count matrix quality control 3.5.1. Empty droplet detection 3.5.2. Doublet detection 3.6. Count data representation 3.7. Brief discussion 3.8. A real-world example 3.8.1. Preparation 3.8.2. Simplified raw data processing pipeline 3.8.3. The complete alevin-fry pipeline 3.8.3.1. Building the index 3.8.3.2. Mapping and quantification 3.9. Useful links 3.10. References 3.11. Contributors 3.11.1. Authors 3.11.2. Reviewers By Lukas Heumos, Anna Schaar, single-cell best practices consortium © Copyright 2023. Brought to you by Theislab, with many thanks to the single-cell community as a whole!

FastQC

Pattern 2: Search Ctrl+K Introduction 1. Prior art 2. Single-cell RNA sequencing 3. Raw data processing 4. Analysis frameworks and tools 5. Interoperability Preprocessing and visualization 6. Quality Control 7. Normalization 8. Feature selection 9. Dimensionality Reduction Identifying cellular structure 10. Clustering 11. Annotation 12. Data integration Inferring trajectories 13. Pseudotemporal ordering 14. RNA velocity 15. Lineage tracing Dealing with conditions 16. Differential gene expression analysis 17. Compositional analysis 18. Gene set enrichment and pathway analysis 19. Perturbation modeling Modeling mechanisms 20. Gene regulatory networks 21. Cell-cell communication Deconvolution 22. Bulk deconvolution Chromatin Accessibility 23. Single-cell ATAC sequencing 24. Quality Control 25. Gene regulatory networks using chromatin accessibility Spatial omics 26. Single-cell data resolved in space 27. Neighborhood analysis 28. Spatial domains 29. Spatially variable genes 30. Spatial deconvolution 31. Imputation Surface protein 32. Quality control 33. Normalization 34. Doublet detection 35. Dimensionality Reduction 36. Batch correction 37. Annotation Adaptive immune receptor repertoire 38. Immune Receptor Profiling 39. Clonotype analysis 43. Specificity analysis 44. Integrating AIR and transcriptomics Multimodal integration 45. Paired integration 46. Advanced integration Reproducibility 47. Reproducibility Outlook 48. Outlook Acknowledgements 49. Acknowledgements Glossary 50. Glossary Repository Suggest edit Open issue .ipynb .pdf Compositional analysis Contents 17.1. Motivation 17.2. Data loading 17.3. Why cell-type count data is compositional 17.4. With labeled clusters 17.5. With labeled clusters and hierarchical structure 17.6. Without labeled clusters 17.6.1. Define neighbourhoods 17.6.2. Count cells in neighbourhoods 17.6.3. Run differential abundance test on neighbourhoods 17.7. Key Takeaways 17.8. Quiz 17.9. References 17.10. Contributors 17.10.1. Authors 17.10.2. Reviewers 17. Compositional analysis# 17.1. Motivation# Beyond changes in gene expression patterns, cell compositions, such as the proportions of cell-types, can change between conditions. A specific drug may, for example, induce a transdifferentiation of a cell type which will be reflected in the cell identity composition. Sufficient cell and sample numbers are required to accurately determine cell-identity cluster proportions and background variation. Compositional analysis can be done on the level of cell identity clusters in the form of known cell types or cell states corresponding to, for example, cells recently affected by perturbations. Fig. 17.1 Differential abundance analysis compares the composition of cell types between two conditions. The samples from both modalities contain different proportions of cell types, which can be tested for significant shifts in abundance.# This chapter will introduce both approaches and apply them to the Haber dataset[Haber et al., 2017]. This dataset contains 53,193 individual epithelial cells from the small intestine and organoids of mice. Some of the cells were also subject to bacterial or helminth infection such as through Salmonella and Heligmosomoides polygyrus respectively. Throughout this tutorial we are using a subset of the complete Haber dataset which only includes control and infected cells that were collected specifically for this purpose. Notably, we are excluding an additional dataset which collected only large cells for faster computation and reduced complexity. As a first step, we load the dataset. 17.2. Data loading# import warnings import pandas as pd warnings.filterwarnings("ignore") warnings.simplefilter("ignore") import matplotlib import matplotlib.pyplot as plt import numpy as np import pertpy as pt import scanpy as sc import seaborn as sns adata = pt.dt.haber_2017_regions() adata AnnData object with n_obs × n_vars = 9842 × 15215 obs: 'batch', 'barcode', 'condition', 'cell_label' adata.obs batch barcode condition cell_label index B1_AAACATACCACAAC_Control_Enterocyte.Progenitor B1 AAACATACCACAAC Control Enterocyte.Progenitor B1_AAACGCACGAGGAC_Control_Stem B1 AAACGCACGAGGAC Control Stem B1_AAACGCACTAGCCA_Control_Stem B1 AAACGCACTAGCCA Control Stem B1_AAACGCACTGTCCC_Control_Stem B1 AAACGCACTGTCCC Control Stem B1_AAACTTGACCACCT_Control_Enterocyte.Progenitor B1 AAACTTGACCACCT Control Enterocyte.Progenitor ... ... ... ... ... B10_TTTCACGACAAGCT_Salmonella_TA B10 TTTCACGACAAGCT Salmonella TA B10_TTTCAGTGAGGCGA_Salmonella_Enterocyte B10 TTTCAGTGAGGCGA Salmonella Enterocyte B10_TTTCAGTGCGACAT_Salmonella_Stem B10 TTTCAGTGCGACAT Salmonella Stem B10_TTTCAGTGTGACCA_Salmonella_Endocrine B10 TTTCAGTGTGACCA Salmonella Endocrine B10_TTTCAGTGTTCTCA_Salmonella_Enterocyte.Progenitor B10 TTTCAGTGTTCTCA Salmonella Enterocyte.Progenitor 9842 rows × 4 columns The data was collected in 10 batches. Unique conditions are Control, Salmonella, Hpoly.Day3 and Hpoly.Day10 which correspond to the healthy control state, Salmonella infection, Heligmosomoides polygyrus infected cells after 3 days and Heligmosomoides polygyrus infected cells after 10 days. The cell_label corresponds to the cell types. 17.3. Why cell-type count data is compositional# When analyzing the compositional shifts in cell count data, multiple technical and methodological limitations need to be accounted for. One challenge is the characteristically low number of experimental replicates, which leads to large confidence intervals when conducting differential abundance analysis with frequentist statistical tests. Even more important, single-cell sequencing is naturally limited in the number of cells per sample - we can’t sequence every cell in a tissue or organ, but use a small, representative snapshot instead. This, however, forces us to view the cell type counts as purely proportional, i.e. the total number of cells in a sample is only a scaling factor. In the statistical literature, such data is known as compositional data[Aitchison, 1982], and characterized by the relative abundances of all features (cell types in our case) in one sample always adding up to one. Because of this sum-to-one constraint, a negative correlation between the cell type abundances is induced. To illustrate this, let’s consider the following example: In a case-control study, we want to compare the cell type composition of a healthy and a diseased organ. In both cases, we have three cell types (A, B and C), but their abundances differ: The healthy organ consists of 2,000 cells of each type (6,000 cells total). The disease leads to a doubling of cell type A, while cell types B and C are not affected, so that the diseased organ has 8,000 cells. healthy_tissue = [2000, 2000, 2000] diseased_tissue = [4000, 2000, 2000] example_data_global = pd.DataFrame( data=np.array([healthy_tissue, diseased_tissue]), index=[1, 2], columns=["A", "B", "C"], ) example_data_global["Disease status"] = ["Healthy", "Diseased"] example_data_global A B C Disease status 1 2000 2000 2000 Healthy 2 4000 2000 2000 Diseased plot_data_global = example_data_global.melt( "Disease status", ["A", "B", "C"], "Cell type", "count" ) fig, ax = plt.subplots(1, 2, figsize=(12, 6)) sns.barplot( data=plot_data_global, x="Disease status", y="count", hue="Cell type", ax=ax[0] ) ax[0].set_title("Global abundances, by status") sns.barplot( data=plot_data_global, x="Cell type", y="count", hue="Disease status", ax=ax[1] ) ax[1].set_title("Global abundances, by cell type") plt.show() We want to find out which cell types increase or decrease in abundance in the diseased organ. If we are able to determine the type of every cell in both organs, the case would be clear, as we can see in the right plot above. Unfortunately, this is not possible. Since our sequencing process has a limited capacity, we can only take a representative sample of 600 cells from both populations. To simulate this step, we can use numpy’s random.multinomial function to sample 600 cells from the populations without replacement: rng = np.random.Generator(1234) healthy_sample = rng.multinomial(pvals=healthy_tissue / np.sum(healthy_tissue), n=600) diseased_sample = rng.multinomial( pvals=diseased_tissue / np.sum(diseased_tissue), n=600 ) example_data_sample = pd.DataFrame( data=np.array([healthy_sample, diseased_sample]), index=[1, 2], columns=["A", "B", "C"], ) example_data_sample["Disease status"] = ["Healthy", "Diseased"] example_data_sample A B C Disease status 1 193 201 206 Healthy 2 296 146 158 Diseased plot_data_sample = example_data_sample.melt( "Disease status", ["A", "B", "C"], "Cell type", "count" ) fig, ax = plt.subplots(1, 2, figsize=(12, 6)) sns.barplot( data=plot_data_sample, x="Disease status", y="count", hue="Cell type", ax=ax[0] ) ax[0].set_title("Sampled abundances, by status") sns.barplot( data=plot_data_sample, x="Cell type", y="count", hue="Disease status", ax=ax[1] ) ax[1].set_title("Sampled abundances, by cell type") plt.show() Now the picture is not clear anymore. While the counts of cell type A still increase (approx. from 200 to 300), the other two cell types seem to decrease from about 200 to 150. This apparent decrease is caused by our constraint to 600 cells - If a larger fraction of the sample is taken up by cell type A, the share of cell types B and C must be lower. Therefore, determining the change in abundance of one cell type is impossible without taking the other cell types into account. If we ignore the compositionality of the data, and use univariate methods like Wilcoxon rank-sum tests or scDC, a method which performs differential cell-type composition analysis by bootstrap resampling[Cao et al., 2019], we may falsely perceive cell-type population shifts as statistically sound effects, although they were induced by inherent negative correlations of the cell-type proportions. Furthermore, the subsampled data does not only give us one valid solution to our question. If both cell types B and C decreased by 1,000 cells in the diseased case, we would obtain the same representative samples of 600 cells as above. To get a unique result, we can fix a reference point for the data, which is assumed to be unchanged throughout all samples[Brill et al., 2019]. This can be a single cell type, an aggregation over multiple cell types such as the geometric mean, or a set of orthogonal bases [Egozcue et al., 2003]. While single-cell datasets of sufficient size and replicate number have only been around for a few years, the same statistical property has also been discussed in the context of microbial analysis[Gloor et al., 2017]. There, some popular approaches include ANCOM-BC [Lin and Peddada, 2020] and ALDEx2 [Fernandes et al., 2014]. However, these approaches often struggle with single-cell datasets due to the small number of experimental replicates. This issue has been tackled by scCODA[Büttner et al., 2021], which we are going to introduce and apply to our dataset in the following section. 17.4. With labeled clusters# scCODA belongs to the family of tools that require pre-defined clusters, most common cell types, to statistically derive changes in composition. Inspired by methods for compositional analysis of microbiome data, scCODA proposes a Bayesian approach to address the low replicate issue as commonly encountered in single-cell analysis[Büttner et al., 2021]. It models cell-type counts using a hierarchical Dirichlet-Multinomial model, which accounts for uncertainty in cell-type proportions and the negative correlative bias via joint modeling of all measured cell-type proportions. To ensure a uniquely identifiable solution and easy interpretability, the reference in scCODA is chosen to be a specific cell type. Hence, any detected compositional changes by scCODA always have to be viewed in relation to the selected reference. However, scCODA assumes a log-linear relationship between covariates and cell abundance, which may not always reflect the underlying biological processes when using continuous covariates. A further limitation of scCODA is the inability to infer correlation structures among cell compositions beyond compositional effects. Furthermore, scCODA only models shifts in mean abundance, but does not detect changes in response variability[Büttner et al., 2021]. As a first step, we instantiate a scCODA model. Then, we use load function to prepare a MuData object for subsequent processing, and it creates a compositional analysis dataset from the input adata. And we specify the cell_type_identifier as cell_label, sample_identifier as batch, and covariate_obs as condition in our case. sccoda_model = pt.tl.Sccoda() sccoda_data = sccoda_model.load( adata, type="cell_level", generate_sample_level=True, cell_type_identifier="cell_label", sample_identifier="batch", covariate_obs=["condition"], ) sccoda_data MuData object with n_obs × n_vars = 9852 × 15223 2 modalities rna: 9842 x 15215 obs: 'batch', 'barcode', 'condition', 'cell_label', 'scCODA_sample_id' coda: 10 x 8 obs: 'condition', 'batch' var: 'n_cells' To get an overview of the cell type distributions across conditions we can use scCODA’s boxplots. To get an even better understanding of how the data is distributed, the red dots show the actual data points. sccoda_model.plot_boxplots( sccoda_data, modality_key="coda", feature_name="condition", figsize=(12, 5), add_dots=True, args_swarmplot={"palette": ["red"]}, ) plt.show() The boxplots highlight some differences in the distributions of the cell types. Clearly noticeable is the high proportion of enterocytes for the Salmonella condition. But other cell types such as transit-amplifying (TA) cells also show stark differences in abundance for the Salmonella condition compared to control. Whether any of these differences are statistically significant has to be properly evaluated. An alternative visualization is a stacked barplot as provided by scCODA. This visualization nicely displays the characteristics of compositional data: If we compare the Control and Salmonella groups, we can see that the proportion of Enterocytes greatly increases in the infected mice. Since the data is proportional, this leads to a decreased share of all other cell types to fulfill the sum-to-one constraint. sccoda_model.plot_stacked_barplot( sccoda_data, modality_key="coda", feature_name="condition", figsize=(4, 2) ) plt.show() scCODA requires two major parameters beyond the cell count AnnData object: A formula and a reference cell type. The formula describes the covariates, which are specified using the R-style. In our case we specify the condition as the only covariate. Since it is a discrete covariate with four levels (control and three disease states), this models a comparison of each state with the other samples. If we wanted to model multiple covariates at once, simply adding them in the formula (i.e. formula = "covariate_1 + covariate_2") is enough. As mentioned above, scCODA requires a reference cell type to compare against, which is believed to be unchanged by the covariates. scCODA can either automatically select an appropriate cell type as reference, which is a cell type that has nearly constant relative abundance over all samples, or be run with a user specified reference cell type. Here we set Endocrine cells as the reference since visually their abundance seems to be rather constant. An alternative to setting a reference cell type manually is to set the reference_cell_type to "automatic" which will force scCODA to select a suitable reference cell type itself. If the choice of reference cell type is unclear, we recommend to use this option to get an indicator or even a final selection. sccoda_data = sccoda_model.prepare( sccoda_data, modality_key="coda", formula="condition", reference_cell_type="Endocrine", ) sccoda_model.run_nuts(sccoda_data, modality_key="coda", rng_key=1234) sample: 100%|██████████| 11000/11000 [01:08<00:00, 161.54it/s, 255 steps of size 1.72e-02. acc. prob=0.85] sccoda_data["coda"].varm["effect_df_condition[T.Salmonella]"] Final Parameter HDI 3% HDI 97% SD Inclusion probability Expected Sample log2-fold change Cell Type Endocrine 0.0000 0.000 0.000 0.000 0.0000 32.598994 -0.526812 Enterocyte 1.5458 0.985 2.071 0.283 0.9996 382.634978 1.703306 Enterocyte.Progenitor 0.0000 -0.475 0.566 0.143 0.2817 126.126003 -0.526812 Goblet 0.0000 -0.345 1.013 0.290 0.4354 52.735108 -0.526812 Stem 0.0000 -0.742 0.297 0.173 0.3092 135.406509 -0.526812 TA 0.0000 -0.876 0.331 0.211 0.3358 78.986854 -0.526812 TA.Early 0.0000 -0.338 0.615 0.151 0.3033 152.670412 -0.526812 Tuft 0.0000 -1.221 0.548 0.342 0.4098 23.041143 -0.526812 The acceptance rate describes the fraction of proposed samples that are accepted after the initial burn-in phase, and can be an ad-hoc indicator for a bad optimization run. In the case of scCODA, the desired acceptance rate is between 0.4 and 0.9. Acceptance rates that are way higher or too low indicate issues with the sampling process. sccoda_data MuData object with n_obs × n_vars = 9852 × 15223 2 modalities rna: 9842 x 15215 obs: 'batch', 'barcode', 'condition', 'cell_label', 'scCODA_sample_id' coda: 10 x 8 obs: 'condition', 'batch' var: 'n_cells' uns: 'scCODA_params' obsm: 'covariate_matrix', 'sample_counts' varm: 'intercept_df', 'effect_df_condition[T.Hpoly.Day3]', 'effect_df_condition[T.Hpoly.Day10]', 'effect_df_condition[T.Salmonella]' scCODA selects credible effects based on their inclusion probability. The cutoff between credible and non-credible effects depends on the desired false discovery rate (FDR). A smaller FDR value will produce more conservative results, but might miss some effects, while a larger FDR value selects more effects at the cost of a larger number of false discoveries. The desired FDR level can be easily set after inference via sim_results.set_fdr(). Per default, the value is 0.05. Since, depending on the dataset, the FDR can have a major influence on the result, we recommend to try out different FDRs up to 0.2 to get the most prominent effects. In our case, we use less strict FDR of 0.2. sccoda_model.set_fdr(sccoda_data, 0.2) To get the binary classification of compositional changes per cell type we use the credible_effects function of scCODA on the result object. Every cell type labeled as “True” is significantly more or less present. The fold-changes describe whether the cell type is more or less present. Hence, we will plot them alongside the binary classification below. sccoda_model.credible_effects(sccoda_data, modality_key="coda") Covariate Cell Type condition[T.Hpoly.Day3] Endocrine False Enterocyte False Enterocyte.Progenitor False Goblet False Stem False TA False TA.Early False Tuft False condition[T.Hpoly.Day10] Endocrine False Enterocyte True Enterocyte.Progenitor False Goblet False Stem False TA False TA.Early False Tuft True condition[T.Salmonella] Endocrine False Enterocyte True Enterocyte.Progenitor False Goblet False Stem False TA False TA.Early False Tuft False Name: Final Parameter, dtype: bool To plot the fold changes together with the binary classification, we can easily use effects_bar_plot function. sccoda_model.plot_effects_barplot(sccoda_data, "coda", "condition") plt.show() The plots nicely show the significant and credible effects of conditions on the cell types. These effects largely agree with the findings in the Haber paper, who used a non-compositional Poisson regression model their findings: “After Salmonella infection, the frequency of mature enterocytes increased substantially.”[Haber et al., 2017] “Heligmosomoides polygyrus caused an increase in the abundance of goblet and tuft cells.”[Haber et al., 2017] Readers familiar with the original publication may wonder why the model used by Haber et al. found more significant effects than scCODA, for example a decrease in Stem and Transit-Amplifying cells in the case of Salmonella infection[Haber et al., 2017]. To explain this discrepancy, remember that cell count data is compositional and therefore an increase in the relative abundance of one cell type will lead to a decrease in the relative abundance of all other cell types. Due to the stark increase of Enterocytes in the small intestinal epithelium of Salmonella-infected mice, all other cell types appear to decrease, even though this shift is only caused by the compositional properties of the data. While the original (univariate) Poisson regression model will pick up these likely false positive effects, scCODA is able to account for the compositionality of the data and therefore does not fall into this trap. 17.5. With labeled clusters and hierarchical structure# In addition to the abundance of each cell type, a typical single-cell dataset also contains information about the similarity of the different cells in the form of a tree-based hierarchical ordering. These hierarchies can either be determined automatically via clustering of the gene expression (which is usually done to discover the clusters of cells that belong to the same cell type), or through biologically informed hierarchies like cell lineages. tascCODA is an extension of scCODA that integrates hierarchical information and experimental covariate data into the generative modeling of compositional count data[Ostner et al., 2021]. This is especially beneficial for cell atlassing efforts with increased resolution. At its core, it uses almost the same Dirichlet-Multinomial setup as scCODA, but extends the model, such that effects on sets of cell types, which are defined as internal nodes in the tree structure. import schist warnings.filterwarnings("ignore") warnings.simplefilter("ignore") objc[13344]: Class GNotificationCenterDelegate is implemented in both /opt/anaconda3/lib/libgio-2.0.0.dylib (0x18ef5ec30) and /usr/local/Cellar/glib/2.74.4/lib/libgio-2.0.0.dylib (0x19be736b0). One of the two will be used. Which one is undefined. To use tascCODA, we first have to define a hierarchical ordering of the cell types. One possible hierarchical clustering uses the eight cell types and orders them by their similarity (pearson correlation) in the PCA representation with sc.tl.dendrogram. Since this structure is very simple in our data and will therefore not give us many new insights, we want to have a more complex clustering. One recent method to get such clusters, is the schist package [Morelli et al., 2021], which uses a nested stochastic block model that clusters the cell population at different resolution levels. Running the method with standard settings takes some time (15 minutes on our data), and gives us an assignment of each cell to a hierarchical clustering in adata.obs. First, we need to define a distance measure between the cells through a PCA embedding: # use logcounts to calculate PCA and neighbors adata.layers["counts"] = adata.X.copy() adata.layers["logcounts"] = sc.pp.log1p(adata.layers["counts"]).copy() adata.X = adata.layers["logcounts"].copy() sc.pp.neighbors(adata, n_neighbors=10, n_pcs=30, random_state=1234) sc.tl.umap(adata) WARNING: You’re trying to run this on 15215 dimensions of .X, if you really want this, set use_rep='X'. Falling back to preprocessing with sc.pp.pca and default params. Then, we can run schist on the AnnData object, which results in a clustering that is defined through a set of columns “nsbm_level_{i}” in adata.obs: schist.inference.nested_model(adata, samples=100, random_seed=5678) adata.obs Show code cell output Hide code cell output objc[13409]: Class GNotificationCenterDelegate is implemented in both /opt/anaconda3/lib/libgio-2.0.0.dylib (0x12f0f1c30) and /usr/local/Cellar/glib/2.74.4/lib/libgio-2.0.0.dylib (0x1448ce6b0). One of the two will be used. Which one is undefined. objc[13410]: Class GNotificationCenterDelegate is implemented in both /opt/anaconda3/lib/libgio-2.0.0.dylib (0x1265f3c30) and /usr/local/Cellar/glib/2.74.4/lib/libgio-2.0.0.dylib (0x13bdb76b0). One of the two will be used. Which one is undefined. objc[13408]: Class GNotificationCenterDelegate is implemented in both /opt/anaconda3/lib/libgio-2.0.0.dylib (0x125a9ec30) and /usr/local/Cellar/glib/2.74.4/lib/libgio-2.0.0.dylib (0x13b1576b0). One of the two will be used. Which one is undefined. objc[13411]: Class GNotificationCenterDelegate is implemented in both /opt/anaconda3/lib/libgio-2.0.0.dylib (0x129969c30) and /usr/local/Cellar/glib/2.74.4/lib/libgio-2.0.0.dylib (0x13f0d36b0). One of the two will be used. Which one is undefined. objc[13407]: Class GNotificationCenterDelegate is implemented in both /opt/anaconda3/lib/libgio-2.0.0.dylib (0x127cb9c30) and /usr/local/Cellar/glib/2.74.4/lib/libgio-2.0.0.dylib (0x13d4106b0). One of the two will be used. Which one is undefined. objc[13414]: Class GNotificationCenterDelegate is implemented in both /opt/anaconda3/lib/libgio-2.0.0.dylib (0x12ee9ac30) and /usr/local/Cellar/glib/2.74.4/lib/libgio-2.0.0.dylib (0x1446806b0). One of the two will be used. Which one is undefined. Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process. Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process. objc[13490]: Class GNotificationCenterDelegate is implemented in both /opt/anaconda3/lib/libgio-2.0.0.dylib (0x124cf9c30) and /usr/local/Cellar/glib/2.74.4/lib/libgio-2.0.0.dylib (0x13a4136b0). One of the two will be used. Which one is undefined. objc[13492]: Class GNotificationCenterDelegate is implemented in both /opt/anaconda3/lib/libgio-2.0.0.dylib (0x131710c30) and /usr/local/Cellar/glib/2.74.4/lib/libgio-2.0.0.dylib (0x146e806b0). One of the two will be used. Which one is undefined. Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process. objc[13660]: Class GNotificationCenterDelegate is implemented in both /opt/anaconda3/lib/libgio-2.0.0.dylib (0x13455ec30) and /usr/local/Cellar/glib/2.74.4/lib/libgio-2.0.0.dylib (0x149c2f6b0). One of the two will be used. Which one is undefined. Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process. objc[13699]: Class GNotificationCenterDelegate is implemented in both /opt/anaconda3/lib/libgio-2.0.0.dylib (0x131764c30) and /usr/local/Cellar/glib/2.74.4/lib/libgio-2.0.0.dylib (0x146ee86b0). One of the two will be used. Which one is undefined. Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process. objc[13757]: Class GNotificationCenterDelegate is implemented in both /opt/anaconda3/lib/libgio-2.0.0.dylib (0x12ad09c30) and /usr/local/Cellar/glib/2.74.4/lib/libgio-2.0.0.dylib (0x1404af6b0). One of the two will be used. Which one is undefined. Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process. objc[14239]: Class GNotificationCenterDelegate is implemented in both /opt/anaconda3/lib/libgio-2.0.0.dylib (0x1278c5c30) and /usr/local/Cellar/glib/2.74.4/lib/libgio-2.0.0.dylib (0x13d0326b0). One of the two will be used. Which one is undefined. Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process. objc[14327]: Class GNotificationCenterDelegate is implemented in both /opt/anaconda3/lib/libgio-2.0.0.dylib (0x132a2ec30) and /usr/local/Cellar/glib/2.74.4/lib/libgio-2.0.0.dylib (0x1481d76b0). One of the two will be used. Which one is undefined. Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process. Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process. objc[14348]: Class GNotificationCenterDelegate is implemented in both /opt/anaconda3/lib/libgio-2.0.0.dylib (0x1343f7c30) and /usr/local/Cellar/glib/2.74.4/lib/libgio-2.0.0.dylib (0x149ba86b0). One of the two will be used. Which one is undefined. objc[14356]: Class GNotificationCenterDelegate is implemented in both /opt/anaconda3/lib/libgio-2.0.0.dylib (0x12f11cc30) and /usr/local/Cellar/glib/2.74.4/lib/libgio-2.0.0.dylib (0x1448ce6b0). One of the two will be used. Which one is undefined. batch barcode condition cell_label scCODA_sample_id nsbm_level_0 nsbm_level_1 nsbm_level_2 nsbm_level_3 nsbm_level_4 nsbm_level_5 index B1_AAACATACCACAAC_Control_Enterocyte.Progenitor B1 AAACATACCACAAC Control Enterocyte.Progenitor B1 0 0 0 0 0 0 B1_AAACGCACGAGGAC_Control_Stem B1 AAACGCACGAGGAC Control Stem B1 1 5 3 1 0 0 B1_AAACGCACTAGCCA_Control_Stem B1 AAACGCACTAGCCA Control Stem B1 10 2 2 1 0 0 B1_AAACGCACTGTCCC_Control_Stem B1 AAACGCACTGTCCC Control Stem B1 34 3 3 1 0 0 B1_AAACTTGACCACCT_Control_Enterocyte.Progenitor B1 AAACTTGACCACCT Control Enterocyte.Progenitor B1 91 35 0 0 0 0 ... ... ... ... ... ... ... ... ... ... ... ... B10_TTTCACGACAAGCT_Salmonella_TA B10 TTTCACGACAAGCT Salmonella TA B10 6 5 3 1 0 0 B10_TTTCAGTGAGGCGA_Salmonella_Enterocyte B10 TTTCAGTGAGGCGA Salmonella Enterocyte B10 142 36 4 1 0 0 B10_TTTCAGTGCGACAT_Salmonella_Stem B10 TTTCAGTGCGACAT Salmonella Stem B10 112 1 1 1 0 0 B10_TTTCAGTGTGACCA_Salmonella_Endocrine B10 TTTCAGTGTGACCA Salmonella Endocrine B10 146 36 4 1 0 0 B10_TTTCAGTGTTCTCA_Salmonella_Enterocyte.Progenitor B10 TTTCAGTGTTCTCA Salmonella Enterocyte.Progenitor B10 77 14 6 3 0 0 9842 rows × 11 columns A UMAP plot nicely shows how the clustering from schist (here on levels 1 and 2) is connected to the cell type assignments. The representation on level 1 of the hierarchy is hereby a strict refinement of the level above, i.e. each cluster from level 2 is split into multiple smaller clusters: sc.pl.umap( adata, color=["nsbm_level_1", "nsbm_level_2", "cell_label"], ncols=3, wspace=0.5 ) Now, we convert our cell-level data to sample-level data and create the tree. We create a tasccoda_model object in the same way as for scCODA, but with the clustering defined by schist and tree levels. The load function of Tasccoda will prepare a MuData object and it converts our tree representation into a ete tree structure and save it as tasccoda_data['coda'].uns["tree"]. To get some clusters that are not too small, we cut the tree before the last level by leaving out "nsbm_level_0". tasccoda_model = pt.tl.Tasccoda() tasccoda_data = tasccoda_model.load( adata, type="cell_level", cell_type_identifier="nsbm_level_1", sample_identifier="batch", covariate_obs=["condition"], levels_orig=["nsbm_level_4", "nsbm_level_3", "nsbm_level_2", "nsbm_level_1"], add_level_name=True, ) tasccoda_data MuData object with n_obs × n_vars = 9852 × 15256 2 modalities rna: 9842 x 15215 obs: 'batch', 'barcode', 'condition', 'cell_label', 'scCODA_sample_id', 'nsbm_level_0', 'nsbm_level_1', 'nsbm_level_2', 'nsbm_level_3', 'nsbm_level_4', 'nsbm_level_5' uns: 'neighbors', 'umap', 'schist', 'nsbm_level_1_colors', 'nsbm_level_2_colors', 'cell_label_colors' obsm: 'X_pca', 'X_umap', 'CM_nsbm_level_0', 'CM_nsbm_level_1', 'CM_nsbm_level_2', 'CM_nsbm_level_3', 'CM_nsbm_level_4', 'CM_nsbm_level_5' layers: 'counts', 'logcounts' obsp: 'distances', 'connectivities' coda: 10 x 41 obs: 'condition', 'batch' var: 'n_cells' uns: 'tree' tasccoda_model.plot_draw_tree(tasccoda_data) The model setup and execution in tascCODA works analogous to scCODA, and also the free parameters for the reference and the formula are the same. Additionally, we can adjust the tree aggregation and model selection via the parameters phi and lambda_1 in the pen_args argument (see [Ostner et al., 2021] for more information). Here, we use an unbiased setting phi=0 and a model selection that is slightly less strict than the default with lambda_1=1.7. We use cluster 18 as our reference, since it is almost identical to the set of Endocrine cells. tasccoda_model.prepare( tasccoda_data, modality_key="coda", reference_cell_type="18", formula="condition", pen_args={"phi": 0, "lambda_1": 3.5}, tree_key="tree", ) Zero counts encountered in data! Added a pseudocount of 0.5. MuData object with n_obs × n_vars = 9852 × 15256 2 modalities rna: 9842 x 15215 obs: 'batch', 'barcode', 'condition', 'cell_label', 'scCODA_sample_id', 'nsbm_level_0', 'nsbm_level_1', 'nsbm_level_2', 'nsbm_level_3', 'nsbm_level_4', 'nsbm_level_5' uns: 'neighbors', 'umap', 'schist', 'nsbm_level_1_colors', 'nsbm_level_2_colors', 'cell_label_colors' obsm: 'X_pca', 'X_umap', 'CM_nsbm_level_0', 'CM_nsbm_level_1', 'CM_nsbm_level_2', 'CM_nsbm_level_3', 'CM_nsbm_level_4', 'CM_nsbm_level_5' layers: 'counts', 'logcounts' obsp: 'distances', 'connectivities' coda: 10 x 41 obs: 'condition', 'batch' var: 'n_cells' uns: 'tree', 'scCODA_params' obsm: 'covariate_matrix', 'sample_counts' tasccoda_model.run_nuts( tasccoda_data, modality_key="coda", rng_key=1234, num_samples=10000, num_warmup=1000 ) sample: 100%|██████████| 11000/11000 [04:50<00:00, 37.83it/s, 127 steps of size 3.18e-02. acc. prob=0.97] tasccoda_model.summary(tasccoda_data, modality_key="coda") Show code cell output Hide code cell output Compositional Analysis summary ┌────────────────────────────────────────────┬────────────────────────────────────────────────────────────────────┐ │ Name │ Value │ ├────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────┤ │ Data │ Data: 10 samples, 41 cell types │ │ Reference cell type │ 18 │ │ Formula │ condition │ └────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ Intercepts │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Final Parameter Expected Sample │ │ Cell Type │ │ 0 1.313 53.195 │ │ 1 1.098 42.904 │ │ 2 1.205 47.749 │ │ 3 0.526 24.215 │ │ 4 -0.707 7.057 │ │ 5 0.634 26.976 │ │ 6 -0.432 9.290 │ │ 7 1.038 40.405 │ │ 8 1.276 51.263 │ │ 9 1.345 54.925 │ │ 10 0.625 26.735 │ │ 11 0.817 32.394 │ │ 12 -0.359 9.994 │ │ 13 0.260 18.559 │ │ 14 0.851 33.514 │ │ 15 0.524 24.166 │ │ 16 0.934 36.414 │ │ 17 -0.142 12.416 │ │ 18 0.684 28.360 │ │ 19 0.857 33.716 │ │ 20 0.198 17.443 │ │ 21 0.209 17.636 │ │ 22 -0.159 12.206 │ │ 23 0.913 35.658 │ │ 24 1.190 47.038 │ │ 25 0.057 15.149 │ │ 26 -0.086 13.131 │ │ 27 -0.002 14.281 │ │ 28 0.786 31.405 │ │ 29 -0.589 7.940 │ │ 30 -0.713 7.014 │ │ 31 0.210 17.654 │ │ 32 -0.797 6.449 │ │ 33 -0.806 6.391 │ │ 34 -0.839 6.184 │ │ 35 -0.104 12.897 │ │ 36 1.443 60.580 │ │ 37 0.215 17.742 │ │ 38 -1.062 4.948 │ │ 39 -0.879 5.942 │ │ 40 0.084 15.564 │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ Effects │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Effect Expected Sample log2-fold change │ │ Covariate Cell Type │ │ conditionT.Hpoly.Day3 0 0.000 51.027 -0.060 │ │ 1 0.000 41.155 -0.060 │ │ 2 -0.257 35.423 -0.431 │ │ 3 0.439 36.030 0.573 │ │ 4 0.000 6.769 -0.060 │ │ 5 0.439 40.139 0.573 │ │ 6 0.000 8.912 -0.060 │ │ 7 0.000 38.759 -0.060 │ │ 8 0.439 76.276 0.573 │ │ 9 -0.257 40.746 -0.431 │ │ 10 0.000 25.645 -0.060 │ │ 11 0.000 31.073 -0.060 │ │ 12 0.000 9.586 -0.060 │ │ 13 0.000 17.803 -0.060 │ │ 14 0.000 32.148 -0.060 │ │ 15 0.000 23.181 -0.060 │ │ 16 0.000 34.930 -0.060 │ │ 17 0.000 11.910 -0.060 │ │ 18 0.000 27.204 -0.060 │ │ 19 0.000 32.342 -0.060 │ │ 20 0.000 16.733 -0.060 │ │ 21 0.439 26.242 0.573 │ │ 22 0.000 11.709 -0.060 │ │ 23 -0.257 26.453 -0.431 │ │ 24 0.000 45.121 -0.060 │ │ 25 0.000 14.532 -0.060 │ │ 26 0.000 12.596 -0.060 │ │ 27 0.000 13.699 -0.060 │ │ 28 0.000 30.125 -0.060 │ │ 29 0.000 7.617 -0.060 │ │ 30 0.000 6.729 -0.060 │ │ 31 0.000 16.935 -0.060 │ │ 32 -0.257 4.784 -0.431 │ │ 33 0.000 6.131 -0.060 │ │ 34 0.000 5.932 -0.060 │ │ 35 0.000 12.371 -0.060 │ │ 36 0.000 58.111 -0.060 │ │ 37 0.000 17.019 -0.060 │ │ 38 0.000 4.746 -0.060 │ │ 39 0.000 5.699 -0.060 │ │ 40 0.439 23.158 0.573 │ │ conditionT.Hpoly.Day10 0 -1.759 12.539 -2.085 │ │ 1 -0.786 26.759 -0.681 │ │ 2 -1.637 12.716 -1.909 │ │ 3 0.000 33.144 0.453 │ │ 4 0.373 14.025 0.991 │ │ 5 0.000 36.924 0.453 │ │ 6 0.000 12.716 0.453 │ │ 7 0.000 55.305 0.453 │ │ 8 0.000 70.166 0.453 │ │ 9 -1.637 14.627 -1.909 │ │ 10 0.000 36.593 0.453 │ │ 11 -0.242 34.808 0.104 │ │ 12 -0.242 10.739 0.104 │ │ 13 0.000 25.403 0.453 │ │ 14 -0.242 36.012 0.104 │ │ 15 -0.242 25.968 0.104 │ │ 16 0.000 49.842 0.453 │ │ 17 0.000 16.994 0.453 │ │ 18 0.000 38.817 0.453 │ │ 19 0.000 46.148 0.453 │ │ 20 -0.242 18.744 0.104 │ │ 21 0.000 24.140 0.453 │ │ 22 -0.242 13.116 0.104 │ │ 23 -1.637 9.496 -1.909 │ │ 24 -1.597 13.038 -1.851 │ │ 25 0.000 20.736 0.453 │ │ 26 -0.242 14.110 0.104 │ │ 27 0.000 19.548 0.453 │ │ 28 -0.242 33.746 0.104 │ │ 29 0.000 10.868 0.453 │ │ 30 0.000 9.601 0.453 │ │ 31 0.000 24.164 0.453 │ │ 32 1.217 29.810 2.209 │ │ 33 0.564 15.377 1.267 │ │ 34 1.186 27.712 2.164 │ │ 35 0.000 17.652 0.453 │ │ 36 -1.716 14.907 -2.023 │ │ 37 0.000 24.285 0.453 │ │ 38 0.000 6.772 0.453 │ │ 39 0.000 8.132 0.453 │ │ 40 0.000 21.303 0.453 │ │ conditionT.Salmonella 0 0.000 34.663 -0.618 │ │ 1 0.000 27.957 -0.618 │ │ 2 0.000 31.114 -0.618 │ │ 3 0.000 15.779 -0.618 │ │ 4 0.000 4.598 -0.618 │ │ 5 0.000 17.578 -0.618 │ │ 6 0.000 6.054 -0.618 │ │ 7 0.000 26.329 -0.618 │ │ 8 0.000 33.404 -0.618 │ │ 9 0.213 44.286 -0.311 │ │ 10 0.000 17.421 -0.618 │ │ 11 0.000 21.108 -0.618 │ │ 12 0.000 6.512 -0.618 │ │ 13 0.000 12.094 -0.618 │ │ 14 2.173 191.842 2.517 │ │ 15 1.547 73.971 1.614 │ │ 16 0.000 23.728 -0.618 │ │ 17 0.000 8.090 -0.618 │ │ 18 0.000 18.480 -0.618 │ │ 19 0.000 21.970 -0.618 │ │ 20 0.000 11.367 -0.618 │ │ 21 0.000 11.492 -0.618 │ │ 22 0.000 7.954 -0.618 │ │ 23 0.000 23.235 -0.618 │ │ 24 0.000 30.651 -0.618 │ │ 25 0.000 9.872 -0.618 │ │ 26 1.547 40.192 1.614 │ │ 27 0.000 9.306 -0.618 │ │ 28 1.547 96.127 1.614 │ │ 29 0.000 5.174 -0.618 │ │ 30 0.000 4.571 -0.618 │ │ 31 0.000 11.504 -0.618 │ │ 32 0.000 4.202 -0.618 │ │ 33 0.000 4.165 -0.618 │ │ 34 0.000 4.030 -0.618 │ │ 35 0.000 8.404 -0.618 │ │ 36 0.000 39.475 -0.618 │ │ 37 0.000 11.561 -0.618 │ │ 38 0.000 3.224 -0.618 │ │ 39 0.000 3.872 -0.618 │ │ 40 0.000 10.142 -0.618 │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ Nodes │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Covariate=condition[T.Hpoly.Day10]_node │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Final Parameter Is credible │ │ Node │ │ nsbm_level_4_0 0.00 False │ │ nsbm_level_3_2 0.00 False │ │ nsbm_level_3_0 0.00 False │ │ nsbm_level_3_1 0.00 False │ │ nsbm_level_3_3 -0.24 True │ │ nsbm_level_2_8 0.00 False │ │ nsbm_level_2_10 0.00 False │ │ 10 0.00 False │ │ 31 0.00 False │ │ nsbm_level_2_0 0.00 False │ │ nsbm_level_2_7 0.00 False │ │ nsbm_level_2_11 0.00 False │ │ nsbm_level_2_3 0.00 False │ │ nsbm_level_2_2 -1.64 True │ │ nsbm_level_2_13 0.00 False │ │ nsbm_level_2_1 0.00 False │ │ nsbm_level_2_4 0.00 False │ │ nsbm_level_2_6 0.00 False │ │ nsbm_level_2_14 0.00 False │ │ 11 0.00 False │ │ 16 0.00 False │ │ 37 0.00 False │ │ 19 0.00 False │ │ 27 0.00 False │ │ 30 0.00 False │ │ 0 -1.76 True │ │ 35 0.00 False │ │ 17 0.00 False │ │ 4 0.37 True │ │ 25 0.00 False │ │ 13 0.00 False │ │ 29 0.00 False │ │ 38 0.00 False │ │ 5 0.00 False │ │ 3 0.00 False │ │ 8 0.00 False │ │ 40 0.00 False │ │ 21 0.00 False │ │ 2 0.00 False │ │ 23 0.00 False │ │ 9 0.00 False │ │ 32 2.85 True │ │ 6 0.00 False │ │ 34 1.19 True │ │ 7 0.00 False │ │ 1 -0.79 True │ │ 24 -1.60 True │ │ 18 0.00 False │ │ 36 -1.72 True │ │ 33 0.56 True │ │ 39 0.00 False │ │ 26 0.00 False │ │ 14 0.00 False │ │ 28 0.00 False │ │ 15 0.00 False │ │ 12 0.00 False │ │ 20 0.00 False │ │ 22 0.00 False │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Covariate=condition[T.Hpoly.Day3]_node │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Final Parameter Is credible │ │ Node │ │ nsbm_level_4_0 0.00 False │ │ nsbm_level_3_2 0.00 False │ │ nsbm_level_3_0 0.00 False │ │ nsbm_level_3_1 0.00 False │ │ nsbm_level_3_3 0.00 False │ │ nsbm_level_2_8 0.00 False │ │ nsbm_level_2_10 0.00 False │ │ 10 0.00 False │ │ 31 0.00 False │ │ nsbm_level_2_0 0.00 False │ │ nsbm_level_2_7 0.00 False │ │ nsbm_level_2_11 0.00 False │ │ nsbm_level_2_3 0.44 True │ │ nsbm_level_2_2 -0.26 True │ │ nsbm_level_2_13 0.00 False │ │ nsbm_level_2_1 0.00 False │ │ nsbm_level_2_4 0.00 False │ │ nsbm_level_2_6 0.00 False │ │ nsbm_level_2_14 0.00 False │ │ 11 0.00 False │ │ 16 0.00 False │ │ 37 0.00 False │ │ 19 0.00 False │ │ 27 0.00 False │ │ 30 0.00 False │ │ 0 0.00 False │ │ 35 0.00 False │ │ 17 0.00 False │ │ 4 0.00 False │ │ 25 0.00 False │ │ 13 0.00 False │ │ 29 0.00 False │ │ 38 0.00 False │ │ 5 0.00 False │ │ 3 0.00 False │ │ 8 0.00 False │ │ 40 0.00 False │ │ 21 0.00 False │ │ 2 0.00 False │ │ 23 0.00 False │ │ 9 0.00 False │ │ 32 0.00 False │ │ 6 0.00 False │ │ 34 0.00 False │ │ 7 0.00 False │ │ 1 0.00 False │ │ 24 0.00 False │ │ 18 0.00 False │ │ 36 0.00 False │ │ 33 0.00 False │ │ 39 0.00 False │ │ 26 0.00 False │ │ 14 0.00 False │ │ 28 0.00 False │ │ 15 0.00 False │ │ 12 0.00 False │ │ 20 0.00 False │ │ 22 0.00 False │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Covariate=condition[T.Salmonella]_node │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Final Parameter Is credible │ │ Node │ │ nsbm_level_4_0 0.00 False │ │ nsbm_level_3_2 0.00 False │ │ nsbm_level_3_0 0.00 False │ │ nsbm_level_3_1 0.00 False │ │ nsbm_level_3_3 0.00 False │ │ nsbm_level_2_8 0.00 False │ │ nsbm_level_2_10 0.00 False │ │ 10 0.00 False │ │ 31 0.00 False │ │ nsbm_level_2_0 0.00 False │ │ nsbm_level_2_7 0.00 False │ │ nsbm_level_2_11 0.00 False │ │ nsbm_level_2_3 0.00 False │ │ nsbm_level_2_2 0.00 False │ │ nsbm_level_2_13 0.00 False │ │ nsbm_level_2_1 0.00 False │ │ nsbm_level_2_4 0.00 False │ │ nsbm_level_2_6 1.55 True │ │ nsbm_level_2_14 0.00 False │ │ 11 0.00 False │ │ 16 0.00 False │ │ 37 0.00 False │ │ 19 0.00 False │ │ 27 0.00 False │ │ 30 0.00 False │ │ 0 0.00 False │ │ 35 0.00 False │ │ 17 0.00 False │ │ 4 0.00 False │ │ 25 0.00 False │ │ 13 0.00 False │ │ 29 0.00 False │ │ 38 0.00 False │ │ 5 0.00 False │ │ 3 0.00 False │ │ 8 0.00 False │ │ 40 0.00 False │ │ 21 0.00 False │ │ 2 0.00 False │ │ 23 0.00 False │ │ 9 0.21 True │ │ 32 0.00 False │ │ 6 0.00 False │ │ 34 0.00 False │ │ 7 0.00 False │ │ 1 0.00 False │ │ 24 0.00 False │ │ 18 0.00 False │ │ 36 0.00 False │ │ 33 0.00 False │ │ 39 0.00 False │ │ 26 0.00 False │ │ 14 0.63 True │ │ 28 0.00 False │ │ 15 0.00 False │ │ 12 0.00 False │ │ 20 0.00 False │ │ 22 0.00 False │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ Again, the acceptance probability is right around the desired value of 0.85 for tascCODA, indicating no apparent problems with the optimization. The result from tascCODA should first and foremost be interpreted as effects on the nodes of the tree. A nonzero parameter on a node means that the aggregated count of all cell types under that node changes significantly. We can easily visualize this in a tree plot for each of the three disease states. Blue circles indicate an increase, red circles a decrease: tasccoda_model.plot_draw_effects( tasccoda_data, modality_key="coda", tree="tree", covariate="condition[T.Salmonella]", show_leaf_effects=False, show_legend=False, ) tasccoda_model.plot_draw_effects( tasccoda_data, modality_key="coda", tree="tree", covariate="condition[T.Hpoly.Day3]", show_leaf_effects=False, show_legend=False, ) tasccoda_model.plot_draw_effects( tasccoda_data, modality_key="coda", tree="tree", covariate="condition[T.Hpoly.Day10]", show_leaf_effects=False, show_legend=False, ) Alternatively, effects on internal nodes can also be translated through the tree onto the cell type level, allowing for a calculation of log-fold changes like in scCODA. To visualize the log-fold changes of the cell types, we do the same plots as for scCODA, inspired by “High-resolution single-cell atlas reveals diversity and plasticity of tissue-resident neutrophils in non-small cell lung cancer”[Salcher et al., 2022]. tasccoda_model.plot_effects_barplot( tasccoda_data, modality_key="coda", covariates="condition" ) <seaborn.axisgrid.FacetGrid at 0x19ad2cd90> Another insightful representation can be gained by plotting the effect sizes for each condition on the UMAP embedding, and comparing it to the cell type assignments: kwargs = {"ncols": 3, "wspace": 0.25, "vcenter": 0, "vmax": 1.5, "vmin": -1.5} tasccoda_model.plot_effects_umap( tasccoda_data, effect_name=[ "effect_df_condition[T.Salmonella]", "effect_df_condition[T.Hpoly.Day3]", "effect_df_condition[T.Hpoly.Day10]", ], cluster_key="nsbm_level_1", **kwargs, ) sc.pl.umap( tasccoda_data["rna"], color=["cell_label", "nsbm_level_1"], ncols=2, wspace=0.5 ) The results are very similar to scCODA’s findings: For the Salmonella infection, we get an aggregated increase in clusters that approximately represent Enterocytes in the cell type clustering. This increase is even stronger for cluster 12, as indicated by the additional positive effect on the leaf level For the Heligmosomoides polygyrus infection, we get no credible changes after 3 days. After 10 days, we pick up decreases in clusters that contain Stem- and transit-amplifying cells, as well as a less strong decrease of Enterocytes and Enterocyte progenitors, which was also picked up by scCODA. 17.6. Without labeled clusters# It is not always possible or practical to use precisely labeled clusters such as cell-type definitions, especially when we are interested in studying transitional states between cell type clusters, such as during developmental processes, or when we expect only a subpopulation of a cell type to be affected by the condition of interest. In such cases, determining compositional changes based on known annotations may not be appropriate. A set of methods exist to detect compositional changes occurring in subpopulations of cells smaller than the cell type clusters, usually defined starting from a k-nearest neighbor (KNN) graph computed from similarities in the same low dimensional space used for clustering. DA-seq computes, for each cell, a score based on the relative prevalence of cells from both biological states in the cell’s neighborhood, using a range of k values[Zhao et al., 2021]. The scores are used as input for a logistic classifier to predict the biological condition of each cell. Milo assigns cells to partially overlapping neighborhoods on the KNN graph, then differential abundance (DA) testing is performed modelling cell counts with a generalized linear model (GLM) [Dann et al., 2022]. MELD calculates a relative likelihood estimate of observing each cell in every condition using graph-based density estimate[Burkhardt et al., 2021]. These methods have unique strengths and weaknesses. Because it relies on logistic classification, DA-seq is designed for pairwise comparisons between two biological conditions, but can’t be applied to test for differences associated with a continuous covariate (such as age or timepoints). DA-seq and Milo use the variance in the abundance statistic between replicate samples of the same condition to estimate the significance of the differential abundance, while MELD doesn’t use this information. While considering consistency across replicates reduces the number of false positives driven by one or a few samples, all KNN-based methods are sensitive to a loss of information if the conditions of interest and confounders, defined by technical or experimental sources of variation, are strongly correlated. The impact of confounders can be mitigated using batch integration methods before KNN graph construction and/or incorporating the confounding covariates in the model for DA testing, as we discuss further in the example below. Another limitation of KNN-based methods to bare in mind is that cells in a neighborhood may not necessarily represent a specific, unique biological subpopulation, because a cellular state may span over multiple neighborhoods. Reducing k for the KNN graph or constructing a graph on cells from a particular lineage of interest can help mitigate this issue and ensure the predicted effects are robust to the choice of parameters and to the data subset used[Dann et al., 2022]. Generally, if large differences are apparent in large clusters by visualization or the imbalances between cell types are of interest, direct analysis with cell-type aware methods, such as scCODA, might be more suitable. KNN-based methods are more powerful when we are interested in differences in cell abundances that might appear in transitional states between cell types or in a specific subset of cells of a given cell type. We will now apply Milo to the Haber dataset to try to find over- or underrepresented neighborhoods of cells upon infection. Milo is available as miloR for R users and in Pertpy for Python users in the scverse ecosystem. In the following demonstration, we will use milo which is easiest to use with our AnnData object due to its scverse compatibility. Be aware that milo in its current state also requires a working edgeR installation. To perform DA analysis with Milo, we need to construct a KNN graph that is representative of the biological similarities between cells, as we do when performing clustering or UMAP visualization of a single-cell dataset. This means (A) building a common low-dimensional space for all samples and (B) minimizing cell-cell similarities driven by technical factors (i.e. batch effects). We first use the standard scanpy workflow for dimensionality reduction to qualitatively assess whether we see a batch effect in this dataset. milo = pt.tl.Milo() adata = pt.dt.haber_2017_regions() mdata = milo.load(adata) mdata MuData object with n_obs × n_vars = 9842 × 15215 2 modalities rna: 9842 x 15215 obs: 'batch', 'barcode', 'condition', 'cell_label' milo: 0 x 0 # use logcounts to calculate PCA and neighbors adata.layers["counts"] = adata.X.copy() adata.layers["logcounts"] = sc.pp.log1p(adata.layers["counts"]).copy() adata.X = adata.layers["logcounts"].copy() sc.pp.highly_variable_genes( adata, n_top_genes=3000, subset=False ) # 3k genes as used by authors for clustering sc.pp.pca(adata) sc.pp.neighbors(adata, n_neighbors=10, n_pcs=30) sc.tl.umap(adata) OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead. sc.pl.umap(adata, color=["condition", "batch", "cell_label"], ncols=3, wspace=0.25) While cell type clusters are broadly captured, we can see residual separation between batches, also for replicates of the same treatment. If we define neighbourhoods on this KNN graph we might have a large fraction of neighbourhoods containing cells from just one or a few batches. This could introduce false negatives, if the variance in number of cells between replicates is too low (e.g. 0 cells for all replicates) or too high (e.g. all zero cells except for one replicate with a large number of cells), but also false positives, especially when, like in this case, the number of replicates per condition is low. To minimize these errors, we apply the scVI method to learn a batch-corrected latent space, as introduced in the integration chapter. import scvi adata_scvi = adata[:, adata.var["highly_variable"]].copy() scvi.model.SCVI.setup_anndata(adata_scvi, layer="counts", batch_key="batch") model_scvi = scvi.model.SCVI(adata_scvi) max_epochs_scvi = int(np.min([round((20000 / adata.n_obs) * 400), 400])) model_scvi.train(max_epochs=max_epochs_scvi) adata.obsm["X_scVI"] = model_scvi.get_latent_representation() sc.pp.neighbors(adata, use_rep="X_scVI") sc.tl.umap(adata) sc.pl.umap(adata, color=["condition", "batch", "cell_label"], ncols=3, wspace=0.25) Here we can see much better mixing between batches and cell labels form much more uniform clusters. Will batch correction remove biological differences between conditions? This really boils down to good experimental design. In an ideal set-up replicates from the same condition will be processed in different batches. This allows to estimate technical differences more accurately and possibly also incorporate the batch as a confounder in the linear model for differential abundance analysis (see below), to further minimize false positives. When, like in this example, batches are confounded with the biological condition of interest, we have to recognize that while we might be minimizing false positives, the rate of false negatives might also increase. The analyst can decide which type of error is more detrimental depending on the dataset and purpose of the differential abundance analysis. 17.6.1. Define neighbourhoods# Milo is a KNN-based model, where cell abundance is quantified on neighbourhoods of cells. In Milo, a neighbourhood is defined as the group of cells connected by an edge to the same cell (index cell) in an undirected KNN graph. While we could in principle have one neighbourhood for each cell in the graph, this would be inefficient and significantly increase the multiple testing burden. Therefore Milo samples a refined set of cells as index cells for neighbourhoods, starting from a random sample of a fraction of cells. The initial proportion can be specified using the prop argument in the milo.make_nhoods function. As by default, we recommend using prop=0.1 (10% of cells) and to reduce to 5% or 2% to increase scalability on large datasets (> 100k cells). If no neighbors_key parameter is specified, Milo uses the neighbours from .obsp. Therefore, ensure that sc.pp.neighbors was run on the correct representation, i.e. an integrated latent space if batch correction was required. milo.make_nhoods(mdata, prop=0.1) Now the binary assignment of cells to neighbourhood is stored in adata.obsm['nhoods']. Here we can see that, as expected, the number of neighbourhoods should be less or equal to the number of cells in the graph times the prop parameter. In this case, less or equal than 984 neighbourhoods. adata.obsm["nhoods"] <9842x847 sparse matrix of type '<class 'numpy.float32'>' with 22864 stored elements in Compressed Sparse Row format> At this point we need to check the median number of cells in each neighbourhood, to make sure the neighbourhoods contain enough cells to detect differences between samples. nhood_size = adata.obsm["nhoods"].toarray().sum(0) plt.hist(nhood_size, bins=20) plt.xlabel("# cells in neighbourhood") plt.ylabel("# neighbouthoods") np.median(nhood_size) 26.0 We expect the minimum number of cells to be equal to the k parameter used during graph construction (k=10 in this case). To increase the statistical power for DA testing, we need a sufficient number of cells from all samples in the majority of the neighbourhoods. We can use the following rule of thumb: to have a median of 3 cells from each sample in a neighbourhood, the number of cells in a neighbourhood should be at least 3 times the number of samples. In this case, we have data from 10 samples. If we want to have on average 3 cells from each sample in a neighbourhood, the minimum number of cells should be 30. Based on the plot above, we have a large number of neighbourhoods with less than 30 cells, which could lead to an underpowered test. To solve this, we just need to recompute the KNN graph using n_neighbors=30. To distinguish this KNN graph used for neighbourhood-level DA analysis from the graph used for UMAP building, we will store this as a distinct graph in adata.obsp. sc.pp.neighbors(adata, n_neighbors=30, use_rep="X_scVI", key_added="milo") milo.make_nhoods(mdata, neighbors_key="milo", prop=0.1) Let’s check that the distribution of neighbourhood sizes has shifted. nhood_size = adata.obsm["nhoods"].toarray().sum(0) plt.hist(nhood_size, bins=20) plt.xlabel("# cells in neighbourhood") plt.ylabel("# neighbouthoods") 17.6.2. Count cells in neighbourhoods# In the next step, Milo counts cells belonging to each of the samples (here identified by the batch column in adata.obs). milo.count_nhoods(mdata, sample_col="batch") MuData object with n_obs × n_vars = 9842 × 15215 2 modalities rna: 9842 x 15215 obs: 'batch', 'barcode', 'condition', 'cell_label', 'nhood_ixs_random', 'nhood_ixs_refined', 'nhood_kth_distance' var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm' uns: 'hvg', 'pca', 'neighbors', 'umap', 'condition_colors', 'batch_colors', 'cell_label_colors', 'nhood_neighbors_key', 'milo' obsm: 'X_pca', 'X_umap', 'X_scVI', 'nhoods' varm: 'PCs' layers: 'counts', 'logcounts' obsp: 'distances', 'connectivities', 'milo_distances', 'milo_connectivities' milo_compositional: 10 x 808 var: 'index_cell', 'kth_distance' uns: 'sample_col' This stores a neighbourhood-level AnnData object, where nhood_adata.X stores the number of cells from each sample in each neighbourhood. mdata["milo"] AnnData object with n_obs × n_vars = 10 × 808 var: 'index_cell', 'kth_distance' uns: 'sample_col' We can verify that the average number of cells per sample times the number of samples roughly corresponds to the number of cells in a neighbourhood. mean_n_cells = mdata["milo"].X.toarray().mean(0) plt.plot(nhood_size, mean_n_cells, ".") plt.xlabel("# cells in nhood") plt.ylabel("Mean # cells per sample in nhood") 17.6.3. Run differential abundance test on neighbourhoods# Milo uses edgeR’s QLF test to test if there are statistically significant differences between the number of cells from a condition of interest in each neighborhood. Here we are interested in detecting in which neighbourhoods there is a significant increase or decrease of cells in response to infection. Since the condition covariate stores many different types of infection, we need to specify which conditions we need to contrast in our differential abundance test (following a convention used in R, by default the last level of the covariate against the rest will be used, in this case Salmonella vs rest). To specify the comparison, we use the syntax used for GLMs in R. Let’s first test for differences associated with Salmonella infection. milo.da_nhoods( mdata, design="condition", model_contrasts="conditionSalmonella-conditionControl" ) milo_results_salmonella = mdata["milo"].obs.copy() milo_results_salmonella condition batch B1 Control B1 B2 Control B2 B3 Control B3 B4 Control B4 B5 Hpoly.Day3 B5 B6 Hpoly.Day3 B6 B7 Hpoly.Day10 B7 B8 Hpoly.Day10 B8 B9 Salmonella B9 B10 Salmonella B10 For each neighbourhood, we calculate a set of statistics. The most important ones to understand are: log-Fold Change (logFC): this represents the effect size of the difference in cell abundance and corresponds to the coefficient associated with the condition of interest in the GLM. If logFC > 0 the neighbourhood is enriched in cells from the condition of interest, if logFC < 0 the neighbourhood is depleted in cells from the condition of interest. Uncorrected p-value (PValue): this is the p-value for the QLF test before multiple testing correction. SpatialFDR: this is the p-value adjusted for multiple testing to limit the false discovery rate. This is calculated adapting the weighted Benjamini-Hochberg (BH) correction introduced by Lun et al [Lun et al., 2017], which accounts for the fact that because neighbourhoods are partially overlapping (i.e. one cell can belong to multiple neighbourhoods) the DA tests on different neighbourhoods are not completely independent. In practice, the BH correction is weighted by the reciprocal of the distance to the k-th nearest neighbor to the index cell (stored in kth_distance), which is used as a proxy for the amount of overlap with other neighbourhoods. You might notice that the SpatialFDR values are always lower or equal to the FDR values, calculated with a conventional BH correction. Before any exploration and interpretation of the results, we can visualize these statistics with a set of diagnostics plots to sanity check our statistical test: def plot_milo_diagnostics(mdata): alpha = 0.1 ## significance threshold with matplotlib.rc_context({"figure.figsize": [12, 12]}): ## Check P-value histogram plt.subplot(2, 2, 1) plt.hist(mdata["milo"].var["PValue"], bins=20) plt.xlabel("Uncorrected P-value") ## Visualize extent of multiple-testing correction plt.subplot(2, 2, 2) plt.scatter( mdata["milo"].var["PValue"], mdata["milo"].var["SpatialFDR"], s=3, ) plt.xlabel("Uncorrected P-value") plt.ylabel("SpatialFDR") ## Visualize volcano plot plt.subplot(2, 2, 3) plt.scatter( mdata["milo"].var["logFC"], -np.log10(mdata["milo"].var["SpatialFDR"]), s=3, ) plt.axhline( y=-np.log10(alpha), color="red", linewidth=1, label=f"{int(alpha100)} % SpatialFDR", ) plt.legend() plt.xlabel("log-Fold Change") plt.ylabel("- log10(SpatialFDR)") plt.tight_layout() ## Visualize MA plot df = mdata["milo"].var emp_null = df[df["SpatialFDR"] >= alpha]["logFC"].mean() df["Sig"] = df["SpatialFDR"] < alpha plt.subplot(2, 2, 4) sns.scatterplot(data=df, x="logCPM", y="logFC", hue="Sig") plt.axhline(y=0, color="grey", linewidth=1) plt.axhline(y=emp_null, color="purple", linewidth=1) plt.legend(title=f"< {int(alpha100)} % SpatialFDR") plt.xlabel("Mean log-counts") plt.ylabel("log-Fold Change") plt.show() plot_milo_diagnostics(mdata) The P-value histogram shows the distribution of P-values before multiple testing correction. By definition, we expect the p-values under the null hypothesis (> significance level) to be uniformly distributed, while the peak of p-values close to zero represents the significant results. This gives you an idea of how conservative your test is, and it might help to spot early some pathological cases. For example, if the distribution of P-values looks bimodal, with a second peak close to 1, this might indicate that you have a large number of neighbourhoods with no variance between replicates of one condition (e.g. all replicates from one condition have 0 cells) which might indicate a residual batch effect or that you need to increase the size of neighbourhoods; if the p-value histogram is left-skewed this might indicate a confounding covariate has not been accounted for in the model. For other pathological cases and possible interpretations see this blogpost. For each neighbourhood we plot the uncorrected P-Value VS the p-value controlling for the Spatial FDR. Here we expect the adjusted p-values to be larger (so points above the diagonal). If the FDR correction is especially severe (i.e. many values close to 1) this might indicate a pathological case. You might be testing on too many neighbourhoods (you can reduce prop in milo.make_nhoods) or there might be too much overlap between neighbourhoods (you might need to decrease k when constructing the KNN graph). The volcano plot gives us an idea of how many neighbourhoods show significant DA after multiple testing correction ( - log(SpatialFDR) > 1) and shows how many neighbourhoods are enriched or depleted of cells from the condition of interest. The MA plot shows the dependency between average number of cells per sample and the log-Fold Change of the test. In a balanced scenario, we expect points to be concentrated around logFC = 0, otherwise the shift might indicate a strong imbalance in average number of cells between samples from different conditions. For more tips on how to interpret the MA plot see MarioniLab/miloR#208. After sanity check, we can visualize the DA results for each neighbourhood by the position of the index cell on the UMAP embedding, to qualitatively assess which cell types may be most affected by the infection. milo.build_nhood_graph(mdata) with matplotlib.rc_context({"figure.figsize": [10, 10]}): milo.plot_nhood_graph(mdata, alpha=0.1, min_size=5, plot_edges=False) sc.pl.umap(mdata["rna"], color="cell_label", legend_loc="on data") This shows a set of neighbourhoods enriched upon Salmonella infection corresponding to mature enterocytes, and a depletion in a subset of stem cell neighbourhoods. For interpretation of results, it’s often useful to annotate neighbourhoods by the cell type cluster that they overlap with. milo.annotate_nhoods(mdata, anno_col="cell_label") # Define as mixed if fraction of cells in nhood with same label is lower than 0.75 mdata["milo"].var.loc[ mdata["milo"].var["nhood_annotation_frac"] < 0.75, "nhood_annotation" ] = "Mixed" milo.plot_da_beeswarm(mdata) plt.show() What about the compositional effect? Comparing the Milo results to the scCODA results, here we identify a strong enrichment upon Salmonella infection in the Enterocytes, but also a depletion in a subset of Stem cells, similarly to what the original authors reported[Haber et al., 2017]. Although we don’t have a ground truth to verify whether the decrease in abundance of stem cells is real, it’s important to note that the GLM in Milo doesn’t explicitly model the compositional nature of cell abundances in neighborhoods, so in theory the results could be affected by compositional biases. In practice, performing a test on a large number of neighbourhoods alleviates this issue, since the effect in the opposite direction is distributed across thousands of neighborhoods rather than tens of cell types. Additionally, the test used in Milo uses the trimmed mean of M values normalization method (TMM normalization [Robinson and Oshlack, 2010]) to estimate normalization factors robust to compositional differences across samples. In this particular example, residual compositional effects might be explained by (A) the relatively small number of neighborhoods (< 1000), (B) the very large effect size in Enterocyte neighbourhoods or (C) the very small number of replicates per condition. Of note, the GLM framework used by Milo allows to test for cell enrichment/depletion also for continuous covariates. We demonstrate this by testing for differential abundance along the Heligmosomoides polygyrus infection time course. ## Turn into continuous variable mdata["rna"].obs["Hpoly_timecourse"] = ( mdata["rna"] .obs["condition"] .cat.reorder_categories(["Salmonella", "Control", "Hpoly.Day3", "Hpoly.Day10"]) ) mdata["rna"].obs["Hpoly_timecourse"] = mdata["rna"].obs["Hpoly_timecourse"].cat.codes ## Here we exclude salmonella samples test_samples = ( mdata["rna"] .obs.batch[mdata["rna"].obs.condition != "Salmonella"] .astype("str") .unique() ) milo.da_nhoods(mdata, design=" Hpoly_timecourse", subset_samples=test_samples) plot_milo_diagnostics(mdata) with matplotlib.rc_context({"figure.figsize": [10, 10]}): milo.plot_nhood_graph(mdata, alpha=0.1, min_size=5, plot_edges=False) milo.plot_da_beeswarm(mdata) plt.show() We can verify that the test captures a linear increase in cell numbers across the time course by plotting the number of cells per sample by condition in neighborhoods where significant enrichment or depletion was detected. entero_ixs = mdata["milo"].var_names[ (mdata["milo"].var["SpatialFDR"] < 0.1) & (mdata["milo"].var["logFC"] < 0) & (mdata["milo"].var["nhood_annotation"] == "Enterocyte") ] plt.title("Enterocyte") milo.plot_nhood_counts_by_cond( mdata, test_var="Hpoly_timecourse", subset_nhoods=entero_ixs ) plt.show() tuft_ixs = mdata["milo"].var_names[ (mdata["milo"].var["SpatialFDR"] < 0.1) & (mdata["milo"].var["logFC"] > 0) & (mdata["milo"].var["nhood_annotation"] == "Tuft") ] plt.title("Tuft cells") milo.plot_nhood_counts_by_cond( mdata, test_var="Hpoly_timecourse", subset_nhoods=tuft_ixs ) plt.show() Interestingly the DA test on the neighbourhoods detects an enrichment upon infection in Tuft cells and in a subset of goblet cells. We can characterize the difference between cell type subpopulations enriched upon infection by examining the mean gene expression profiles of cells in neighbourhoods. For example, if we take the neighbourhoods of Goblet cells, we can see that neighbourhoods enriched upon infection display a higher expression of Retnlb, which is a gene implicated in anti-parasitic immunity [Haber et al., 2017]. ## Compute average Retnlb expression per neighbourhood # (you can add mean expression for all genes using milo.utils.add_nhood_expression) mdata["rna"].obs["Retnlb_expression"] = ( mdata["rna"][:, "Retnlb"].layers["logcounts"].toarray().ravel() ) milo.annotate_nhoods_continuous(mdata, "Retnlb_expression") # milo.annotate_nhoods(mdata, "Retnlb_expression") ## Subset to Goblet cell neighbourhoods nhood_df = mdata["milo"].var.copy() nhood_df = nhood_df[nhood_df["nhood_annotation"] == "Goblet"] sns.scatterplot(data=nhood_df, x="logFC", y="nhood_Retnlb_expression") plt.show() Accounting for confounding covariates: several confounding factors might affect cell abundances and proportions other than our condition of interest. For example, different set of samples might have been processed or sequenced in the same batch, or a set of samples could contain cells FAC-sorted using different markers to enrich a subset of populations of interest. As long as these factors are not completely correlated with the condition of interest, we can include these covariates in the model used for differential abundance testing, to estimate differential abundance associated with the condition of interest, while minimizing differences explained by the confounding factors. In Milo, we can express this type of test design using the syntax ~ confounder + condition. ## Make dummy confounder for the sake of this example nhood_adata = mdata["milo"].copy() conf_dict = dict( zip( nhood_adata.obs_names, rng.choice(["group1", "group2"], nhood_adata.n_obs), ) ) mdata["rna"].obs["dummy_confounder"] = [conf_dict[x] for x in mdata["rna"].obs["batch"]] milo.da_nhoods(mdata, design=" dummy_confounder+condition") mdata["milo"].var index_cell kth_distance SpatialFDR Sig Nhood_size nhood_annotation nhood_annotation_frac nhood_Retnlb_expression logFC logCPM F PValue FDR 0 B1_AAAGGCCTAAGGCG_Control_Stem 1.304513 0.056105 False 53.0 Stem 0.830189 0.033807 -3.696119 10.690300 9.083460 0.002595 0.066143 1 B1_AACACGTGATGCTG_Control_TA.Early 1.335187 0.938353 False 67.0 Mixed 0.477612 0.020691 -0.248959 10.926409 0.078934 0.778762 0.941976 2 B1_AACTTGCTGGTATC_Control_Enterocyte 1.519376 0.693653 False 49.0 Enterocyte 1.000000 0.000000 0.951949 10.727155 1.416844 0.233993 0.720537 3 B1_AAGAACGATGACTG_Control_Enterocyte 2.143153 0.746709 False 39.0 Enterocyte 1.000000 0.000000 0.879631 10.467676 1.109896 0.292168 0.769131 4 B1_AATTACGAAACAGA_Control_Enterocyte.Progenitor 1.600587 0.436180 False 37.0 Enterocyte.Progenitor 0.945946 0.000000 -2.440105 10.296494 3.202427 0.073604 0.478744 ... ... ... ... ... ... ... ... ... ... ... ... ... ... 803 B10_TTAGGTCTAGACTC_Salmonella_Goblet 1.547428 0.400895 False 48.0 Goblet 1.000000 0.126426 1.931002 10.566431 3.591423 0.058150 0.443255 804 B10_TTAGTCACCATGGT_Salmonella_TA.Early 1.348982 0.880493 False 64.0 TA.Early 0.875000 0.010830 -0.553709 10.876626 0.342947 0.558166 0.886244 805 B10_TTATGGCTTAACGC_Salmonella_TA.Early 1.357123 0.981884 False 51.0 TA.Early 0.862745 0.000000 0.071987 10.649724 0.007052 0.933080 0.982928 806 B10_TTCATCGACCGTAA_Salmonella_TA.Early 1.313244 0.908718 False 66.0 TA.Early 0.787879 0.021004 -0.403998 10.825716 0.176318 0.674579 0.916060 807 B10_TTGAACCTCATTTC_Salmonella_TA.Early 1.333960 0.787177 False 55.0 Mixed 0.454545 0.012603 0.774925 10.712952 0.811296 0.367791 0.805382 808 rows × 13 columns 17.7. Key Takeaways# If the primary interest lies in compositional changes among known cell-types or states, use scCODA or tascCODA to statistically evaluate changes in abundance. KNN based methods like DA-Seq or MILO should be used if the data does not cluster distinctly, such as during developmental processes, if we are interested in differences in cell abundances that might appear in transitional states between cell types or in a specific subset of cells of a given cell type. 17.8. Quiz# It is tricky to deduce compositional changes visually. Why? Why is it necessary to interpret cell type abundances as proportions instead of absolute counts? What are the dangers of not doing so? In which cases should tools that use cluster information, such as cell types be used, and in which cases tools that do not use cluster information? 17.9. References# [compAit82] J. Aitchison. The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological), 44(2):139–160, 1982. URL: https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1982.tb01195.x, arXiv:https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.2517-6161.1982.tb01195.x, doi:https://doi.org/10.1111/j.2517-6161.1982.tb01195.x. [compBAH19] Barak Brill, Amnon Amir, and Ruth Heller. Testing for differential abundance in compositional counts data, with application to microbiome studies. ArXiv, April 2019. URL: http://arxiv.org/abs/1904.08937, arXiv:1904.08937. [compBST+21] Daniel B. Burkhardt, Jay S. Stanley, Alexander Tong, Ana Luisa Perdigoto, Scott A. Gigante, Kevan C. Herold, Guy Wolf, Antonio J. Giraldez, David van Dijk, and Smita Krishnaswamy. Quantifying the effect of experimental perturbations at single-cell resolution. Nature Biotechnology, 39(5):619–629, May 2021. URL: https://doi.org/10.1038/s41587-020-00803-5, doi:10.1038/s41587-020-00803-5. [compButtnerOMuller+21] (1,2,3) M. Büttner, J. Ostner, C. L. Müller, F. J. Theis, and B. Schubert. Sccoda is a bayesian model for compositional single-cell data analysis. Nature Communications, 12(1):6876, Nov 2021. URL: https://doi.org/10.1038/s41467-021-27150-6, doi:10.1038/s41467-021-27150-6. [compCLO+19] Yue Cao, Yingxin Lin, John T. Ormerod, Pengyi Yang, Jean Y.H. Yang, and Kitty K. Lo. Scdc: single cell differential composition analysis. BMC Bioinformatics, 20(19):721, Dec 2019. URL: https://doi.org/10.1186/s12859-019-3211-9, doi:10.1186/s12859-019-3211-9. [compDHT+22] (1,2) Emma Dann, Neil C. Henderson, Sarah A. Teichmann, Michael D. Morgan, and John C. Marioni. Differential abundance testing on single-cell data using k-nearest neighbor graphs. Nature Biotechnology, 40(2):245–253, Feb 2022. URL: https://doi.org/10.1038/s41587-021-01033-z, doi:10.1038/s41587-021-01033-z. [compEPGMFBarceloV03] J J Egozcue, V Pawlowsky-Glahn, G Mateu-Figueras, and C Barceló-Vidal. Isometric logratio transformations for compositional data analysis. Math. Geol., 35(3):279–300, April 2003. URL: https://doi.org/10.1023/A:1023818214614, doi:10.1023/A:1023818214614. [compFRM+14] Andrew D Fernandes, Jennifer Ns Reid, Jean M Macklaim, Thomas A McMurrough, David R Edgell, and Gregory B Gloor. Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome, 2:15, May 2014. URL: http://dx.doi.org/10.1186/2049-2618-2-15, doi:10.1186/2049-2618-2-15. [compGMPGE17] Gregory B Gloor, Jean M Macklaim, Vera Pawlowsky-Glahn, and Juan J Egozcue. Microbiome datasets are compositional: and this is not optional. Front. Microbiol., 8:2224, November 2017. URL: http://dx.doi.org/10.3389/fmicb.2017.02224, doi:10.3389/fmicb.2017.02224. [compHBR+17] (1,2,3,4,5,6) Adam L. Haber, Moshe Biton, Noga Rogel, Rebecca H. Herbst, Karthik Shekhar, Christopher Smillie, Grace Burgin, Toni M. Delorey, Michael R. Howitt, Yarden Katz, Itay Tirosh, Semir Beyaz, Danielle Dionne, Mei Zhang, Raktima Raychowdhury, Wendy S. Garrett, Orit Rozenblatt-Rosen, Hai Ning Shi, Omer Yilmaz, Ramnik J. Xavier, and Aviv Regev. A single-cell survey of the small intestinal epithelium. Nature, 551(7680):333–339, Nov 2017. URL: https://doi.org/10.1038/nature24489, doi:10.1038/nature24489. [compLP20] Huang Lin and Shyamal Das Peddada. Analysis of compositions of microbiomes with bias correction. Nat. Commun., 11(1):3514, July 2020. URL: http://dx.doi.org/10.1038/s41467-020-17041-7, doi:10.1038/s41467-020-17041-7. [compLRM17] Aaron T. L. Lun, Arianne C. Richard, and John C. Marioni. Testing for differential abundance in mass cytometry data. Nature Methods, 14(7):707–709, July 2017. doi:10.1038/nmeth.4295. [compMGC21] Leonardo Morelli, Valentina Giansanti, and Davide Cittaro. Nested stochastic block models applied to the analysis of single cell data. BMC Bioinformatics, 22(1):576, November 2021. URL: http://dx.doi.org/10.1186/s12859-021-04489-7, doi:10.1186/s12859-021-04489-7. [compOCM21] (1,2) Johannes Ostner, Salomé Carcy, and Christian L. Müller. Tasccoda: bayesian tree-aggregated analysis of compositional amplicon and single-cell data. Frontiers in Genetics, 2021. URL: https://www.frontiersin.org/article/10.3389/fgene.2021.766405, doi:10.3389/fgene.2021.766405. [compRO10] Mark D. Robinson and Alicia Oshlack. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(3):R25, March 2010. doi:10.1186/gb-2010-11-3-r25. [compSSH+22] Stefan Salcher, Gregor Sturm, Lena Horvath, Gerold Untergasser, Georgios Fotakis, Elisa Panizzolo, Agnieszka Martowicz, Georg Pall, Gabriele Gamerith, Martina Sykora, Florian Augustin, Katja Schmitz, Francesca Finotello, Dietmar Rieder, Sieghart Sopper, Dominik Wolf, Andreas Pircher, and Zlatko Trajanoski. High-resolution single-cell atlas reveals diversity and plasticity of tissue-resident neutrophils in non-small cell lung cancer. bioRxiv, 2022. URL: https://www.biorxiv.org/content/early/2022/05/10/2022.05.09.491204, arXiv:https://www.biorxiv.org/content/early/2022/05/10/2022.05.09.491204.full.pdf, doi:10.1101/2022.05.09.491204. [compZJL+21] Jun Zhao, Ariel Jaffe, Henry Li, Ofir Lindenbaum, Esen Sefik, Ruaidhrí Jackson, Xiuyuan Cheng, Richard A. Flavell, and Yuval Kluger. Detection of differentially abundant cell subpopulations in scrna-seq data. Proceedings of the National Academy of Sciences, 118(22):e2100293118, 2021. URL: https://www.pnas.org/doi/abs/10.1073/pnas.2100293118, arXiv:https://www.pnas.org/doi/pdf/10.1073/pnas.2100293118, doi:10.1073/pnas.2100293118. 17.10. Contributors# We gratefully acknowledge the contributions of: 17.10.1. Authors# Johannes Ostner Emma Dann Lukas Heumos Anastasia Litinetskaya 17.10.2. Reviewers# previous 16. Differential gene expression analysis next 18. Gene set enrichment and pathway analysis Contents 17.1. Motivation 17.2. Data loading 17.3. Why cell-type count data is compositional 17.4. With labeled clusters 17.5. With labeled clusters and hierarchical structure 17.6. Without labeled clusters 17.6.1. Define neighbourhoods 17.6.2. Count cells in neighbourhoods 17.6.3. Run differential abundance test on neighbourhoods 17.7. Key Takeaways 17.8. Quiz 17.9. References 17.10. Contributors 17.10.1. Authors 17.10.2. Reviewers By Lukas Heumos, Anna Schaar, single-cell best practices consortium © Copyright 2023. Brought to you by Theislab, with many thanks to the single-cell community as a whole!

import warnings

import pandas as pd

warnings.filterwarnings("ignore")
warnings.simplefilter("ignore")

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pertpy as pt
import scanpy as sc
import seaborn as sns

Pattern 3: Because of this sum-to-one constraint, a negative correlation between the cell type abundances is induced. To illustrate this, let’s consider the following example:

healthy_tissue = [2000, 2000, 2000]
diseased_tissue = [4000, 2000, 2000]
example_data_global = pd.DataFrame(
    data=np.array([healthy_tissue, diseased_tissue]),
    index=[1, 2],
    columns=["A", "B", "C"],
)
example_data_global["Disease status"] = ["Healthy", "Diseased"]
example_data_global

Pattern 4: Search Ctrl+K Introduction 1. Prior art 2. Single-cell RNA sequencing 3. Raw data processing 4. Analysis frameworks and tools 5. Interoperability Preprocessing and visualization 6. Quality Control 7. Normalization 8. Feature selection 9. Dimensionality Reduction Identifying cellular structure 10. Clustering 11. Annotation 12. Data integration Inferring trajectories 13. Pseudotemporal ordering 14. RNA velocity 15. Lineage tracing Dealing with conditions 16. Differential gene expression analysis 17. Compositional analysis 18. Gene set enrichment and pathway analysis 19. Perturbation modeling Modeling mechanisms 20. Gene regulatory networks 21. Cell-cell communication Deconvolution 22. Bulk deconvolution Chromatin Accessibility 23. Single-cell ATAC sequencing 24. Quality Control 25. Gene regulatory networks using chromatin accessibility Spatial omics 26. Single-cell data resolved in space 27. Neighborhood analysis 28. Spatial domains 29. Spatially variable genes 30. Spatial deconvolution 31. Imputation Surface protein 32. Quality control 33. Normalization 34. Doublet detection 35. Dimensionality Reduction 36. Batch correction 37. Annotation Adaptive immune receptor repertoire 38. Immune Receptor Profiling 39. Clonotype analysis 43. Specificity analysis 44. Integrating AIR and transcriptomics Multimodal integration 45. Paired integration 46. Advanced integration Reproducibility 47. Reproducibility Outlook 48. Outlook Acknowledgements 49. Acknowledgements Glossary 50. Glossary Repository Suggest edit Open issue .ipynb .pdf Interoperability Contents 5.1. Motivation 5.2. Nomenclature 5.3. Disk-based interoperability 5.3.1. Simple formats 5.3.2. HDF5-based formats 5.3.2.1. H5AD 5.3.2.1.1. Reading/writing H5AD with Bioconductor 5.3.2.1.2. Reading/writing H5AD with {Seurat} 5.3.2.1.3. Reading/writing H5AD with {anndata} 5.3.2.2. Loom 5.3.3. RDS files 5.3.4. New on-disk formats 5.4. In-memory interoperability 5.4.1. Interoperability between R ecosystems 5.4.2. Accessing R from Python 5.4.3. Accessing Python from R 5.5. Interoperability for multimodal data 5.5.1. Python 5.5.2. R 5.5.2.1. Bioconductor 5.5.2.2. Seurat 5.6. Interoperability with other languages 5.6.1. Julia 5.6.2. JavaScript 5.6.3. Rust 5.7. Session information 5.8. Python 5.9. R 5.10. References 5.11. Contributors 5.11.1. Authors 5.11.2. Reviewers 5. Interoperability# Summary Interoperabilty between languages allows analysts to take advantage of the strengths of different ecosystems On-disk interoperability uses standard file formats to transfer data and is typically more reliable In-memory interoperabilty transfers data directly between parallel sessions and is convenient for interactive analysis While interoperability is currently possible developers continue to improve the experience 5.1. Motivation# As we have discussed in the analysis frameworks and tools chapter there are three main ecosystems for single-cell analysis, the Bioconductor and Seurat ecosystems in R and the Python-based scverse ecosystem. A common question from new analysts is which ecosystem they should focus on learning and using? While it makes sense to focus on one to start with, and a successful standard analysis can be performed in any ecosystem, we promote the idea that competent analysts should be familiar with all three ecosystems and comfortable moving between them. This approach allows analysts to use the best-performing tools and methods regardless of how they were implemented. When analysts are not comfortable moving between ecosystems they often tend to use packages that are easy to access, even when they have been shown to have shortcomings compared to packages in another ecosystem. The ability of analysts to move between ecosystems allows developers to take advantage of the different strengths of programming languages. For example, R has strong inbuilt support for complex statistical modelling while the majority of deep learning libraries are focused on Python. By supporting common on-disk data formats and in-memory data structures developers can be confident that analysts can access their package and can use the platform that is most appropriate for their method. Another motivation for being comfortable with multiple ecosystems is the accessibility and availability of data, results and documentation. Often data or results are only made available in one format and analysts will need to be familiar with that format in order to access it. A basic understanding of other ecosystems is also necessary to understand package documentation and tutorials when deciding which methods to use. While we encourage analysts to be comfortable with all the major ecosystems, moving between them is only possible when they are interoperable. Thankfully, lots of work has been done in this area and it is now relatively simple in most cases using standard packages. In this chapter, we discuss the various ways data can be moved between ecosystems via disk or in-memory, the differences between them and their advantages. We focus on single-modality data and moving between R and Python as these are the most common cases but we also touch on multimodal data and other languages. 5.2. Nomenclature# Because talking about different languages can get confusing we try to use the following conventions: {package} - An R package package::function() - A function in an R package package - A Python package package.function() - A function in a Python package Emphasised - Some other important concept code - Other parts of code including objects, variables etc. This is also used for files or directories. import tempfile from pathlib import Path import anndata import anndata2ri import mudata import numpy import rpy2.robjects import scanpy from scipy.sparse import csr_matrix anndata2ri.activate() %load_ext rpy2.ipython 5.3. Disk-based interoperability# The first approach to moving between languages is via disk-based interoperability. This involves writing a file to disk in one language and then reading that file into a second language. In many cases, this approach is simpler, more reliable and scalable than in-memory interoperability (which we discuss below) but it comes at the cost of greater storage requirements and reduced interactivity. Disk-based interoperability tends to work particularly well when there are established processes for each stage of analysis and you want to pass objects from one to the next (especially as part of a pipeline developed using a workflow manager such as Nextflow or snakemake). However, disk-based interoperability is less convenient for interactive steps such as data exploration or experimenting with methods as you need to write a new file whenever you want to move between languages. 5.3.1. Simple formats# Before discussing file formats specifically developed for single-cell data we want to briefly mention that common simple text file formats (such as CSV, TSV, JSON etc.) can often be the answer to transferring data between languages. They work well in cases where some analysis has been performed and what you want to transfer is a subset of the information about an experiment. For example, you may want to transfer only the cell metadata but do not require the feature metadata, expression matrices etc. The advantage of using simple text formats is that they are well supported by almost any language and do not require single-cell specific packages. However, they can quickly become impractical as what you want to transfer becomes more complex. 5.3.2. HDF5-based formats# The most common disk formats for single-cell data are based on Hierarchical Data Format version 5 or HDF5. This is an open-source file format designed for storing large, complex and heterogeneous data. It has a file directory type structure (similar to how files and folders are organised on your computer) which allows many different kinds of data to be stored in a single file with an arbitrarily complex hierarchy. While this format is very flexible, to properly interact with it you need to know where and how the different information is stored. For this reason, standard specifications for storing single-cell data in HDF5 files have been developed. 5.3.2.1. H5AD# The H5AD format is the HDF5 disk representation of the AnnData object used by scverse packages and is commonly used to share single-cell datasets. As it is part of the scverse ecosystem, reading and writing these files from Python is well-supported and is part of the core functionality of the anndata package (read more about the format here). To demonstrate interoperability we will use a small, randomly generated dataset that has gone through some of the steps of a standard analysis workflow to populate the various slots. # Create a randomly generated AnnData object to use as an example counts = csr_matrix( numpy.random.default_generator().poisson(1, size=(100, 2000)), dtype=numpy.float32 ) adata = anndata.AnnData(counts) adata.obs_names = [f"Cell_{i:d}" for i in range(adata.n_obs)] adata.var_names = [f"Gene_{i:d}" for i in range(adata.n_vars)] # Do some standard processing to populate the object scanpy.pp.calculate_qc_metrics(adata, inplace=True) adata.layers["counts"] = adata.X.copy() scanpy.pp.normalize_total(adata, inplace=True) scanpy.pp.log1p(adata) scanpy.pp.highly_variable_genes(adata, inplace=True) scanpy.tl.pca(adata) scanpy.pp.neighbors(adata) scanpy.tl.umap(adata) adata AnnData object with n_obs × n_vars = 100 × 2000 obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes' var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm' uns: 'log1p', 'hvg', 'pca', 'neighbors', 'umap' obsm: 'X_pca', 'X_umap' varm: 'PCs' layers: 'counts' obsp: 'distances', 'connectivities' We will write this mock object to disk as a H5AD file to demonstrate how those files can be read from R. temp_dir = tempfile.TemporaryDirectory() h5ad_file = Path(temp_dir.name) / "example.h5ad" adata.write_h5ad(h5ad_file) Several packages exist for reading and writing H5AD files from R. While they result in a file on disk these packages usually rely on wrapping the Python anndata package to handle the actual reading and writing of files with an in-memory conversion step to convert between R and Python. 5.3.2.1.1. Reading/writing H5AD with Bioconductor# The Bioconductor {zellkonverter} package helps make this easier by using the {basilisk} package to manage creating an appropriate Python environment. If that all sounds a bit technical, the end result is that Bioconductor users can read and write H5AD files using commands like below without requiring any knowledge of Python. Unfortunately, because of the way this book is made, we are unable to run the code directly here. Instead, we will show the code and what the output looks like when run in an R session: sce <- zellkonverter::readH5AD(h5ad_file, verbose = TRUE) ℹ Using the Python reader ℹ Using anndata version 0.8.0 ✔ Read /.../luke.zappia/Downloads/example.h5ad [113ms] ✔ uns$hvg$flavor converted [17ms] ✔ uns$hvg converted [50ms] ✔ uns$log1p converted [25ms] ✔ uns$neighbors converted [18ms] ✔ uns$pca$params$use_highly_variable converted [16ms] ✔ uns$pca$params$zero_center converted [16ms] ✔ uns$pca$params converted [80ms] ✔ uns$pca$variance converted [17ms] ✔ uns$pca$variance_ratio converted [16ms] ✔ uns$pca converted [184ms] ✔ uns$umap$params$a converted [16ms] ✔ uns$umap$params$b converted [16ms] ✔ uns$umap$params converted [80ms] ✔ uns$umap converted [112ms] ✔ uns converted [490ms] ✔ Converting uns to metadata ... done ✔ X matrix converted to assay [29ms] ✔ layers$counts converted [27ms] ✔ Converting layers to assays ... done ✔ var converted to rowData [25ms] ✔ obs converted to colData [24ms] ✔ varm$PCs converted [18ms] ✔ varm converted [47ms] ✔ Converting varm to rowData$varm ... done ✔ obsm$X_pca converted [15ms] ✔ obsm$X_umap converted [16ms] ✔ obsm converted [80ms] ✔ Converting obsm to reducedDims ... done ℹ varp is empty and was skipped ✔ obsp$connectivities converted [22ms] ✔ obsp$distances converted [23ms] ✔ obsp converted [92ms] ✔ Converting obsp to colPairs ... done ✔ SingleCellExperiment constructed [164ms] ℹ Skipping conversion of raw ✔ Converting AnnData to SingleCellExperiment ... done Because we have turned on the verbose output you can see how {zellkonverter} reads the file using Python and converts each part of the AnnData object to a Bioconductor SingleCellExperiment object. We can see what the result looks like: sce class: SingleCellExperiment dim: 2000 100 metadata(5): hvg log1p neighbors pca umap assays(2): X counts rownames(2000): Gene_0 Gene_1 ... Gene_1998 Gene_1999 rowData names(11): n_cells_by_counts mean_counts ... dispersions_norm varm colnames(100): Cell_0 Cell_1 ... Cell_98 Cell_99 colData names(8): n_genes_by_counts log1p_n_genes_by_counts ... pct_counts_in_top_200_genes pct_counts_in_top_500_genes reducedDimNames(2): X_pca X_umap mainExpName: NULL altExpNames(0): This object can then be used as normal by any Bioconductor package. If we want to write a new H5AD file we can use the writeH5AD() function: zellkonverter_h5ad_file <- tempfile(fileext = ".h5ad") zellkonverter::writeH5AD(sce, zellkonverter_h5ad_file, verbose = TRUE) ℹ Using anndata version 0.8.0 ℹ Using the 'X' assay as the X matrix ✔ Selected X matrix [29ms] ✔ assays$X converted to X matrix [50ms] ✔ additional assays converted to layers [30ms] ✔ rowData$varm converted to varm [28ms] ✔ reducedDims converted to obsm [68ms] ✔ metadata converted to uns [24ms] ℹ rowPairs is empty and was skipped ✔ Converting AnnData to SingleCellExperiment ... done ✔ Wrote '/.../.../rj/.../T/.../file102cfa97cc51.h5ad ' [133ms] We can then read this file in Python: scanpy.read_h5ad(zellkonverter_h5ad_file) AnnData object with n_obs × n_vars = 100 × 2000 obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes' var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm' uns: 'X_name', 'hvg', 'log1p', 'neighbors', 'pca', 'umap' obsm: 'X_pca', 'X_umap' varm: 'PCs' layers: 'counts' obsp: 'connectivities', 'distances' If this the first time that you run a {zellkonverter} function you will see that it first creates a special conda environment to use (which can take a while). Once that environment exists it will be re-used by following function calls. {zellkonverter} has additional options such as allowing you to selectively read or write parts for an object. Please refer to the package documentation for more details. Similar functionality for writing a SingleCellExperimentObject to an H5AD file can be found in the {sceasy} package. While these packages are effective, wrapping Python requires some overhead which should be addressed by native R H5AD writers/readers in the future. 5.3.2.1.2. Reading/writing H5AD with {Seurat}# Converting between a Seurat object and an H5AD file is a two-step process as suggested by this tutorial. Firstly H5AD file is converted to a H5Seurat file (a custom HDF5 format for Seurat objects) using the {SeuratDisk} package and then this file is read as a Seurat object. %%R -i h5ad_file message("Converting H5AD to H5Seurat...") SeuratDisk::Convert(h5ad_file, dest = "h5seurat", overwrite = TRUE) message("Reading H5Seurat...") h5seurat_file <- gsub(".h5ad", ".h5seurat", h5ad_file) seurat <- SeuratDisk::LoadH5Seurat(h5seurat_file, assays = "RNA") message("Read Seurat object:") seurat R[write to console]: Converting H5AD to H5Seurat... R[write to console]: The legacy packages maptools, rgdal, and rgeos, underpinning the sp package, which was just loaded, will retire in October 2023. Please refer to R-spatial evolution reports for details, especially https://r-spatial.org/r/2023/05/15/evolution4.html. It may be desirable to make the sf package available; package maintainers should consider adding sf to Suggests:. The sp package is now running under evolution status 2 (status 2 uses the sf package in place of rgdal) WARNING: The R package "reticulate" only fixed recently an issue that caused a segfault when used with rpy2: https://github.com/rstudio/reticulate/pull/1188 Make sure that you use a version of that package that includes the fix. R[write to console]: Registered S3 method overwritten by 'SeuratDisk': method from as.sparse.H5Group Seurat R[write to console]: Warnung: R[write to console]: Unknown file type: h5ad R[write to console]: Warnung: R[write to console]: 'assay' not set, setting to 'RNA' R[write to console]: Creating h5Seurat file for version 3.1.5.9900 R[write to console]: Adding X as data R[write to console]: Adding X as counts R[write to console]: Adding meta.features from var R[write to console]: Adding X_pca as cell embeddings for pca R[write to console]: Adding X_umap as cell embeddings for umap R[write to console]: Adding PCs as feature loadings fpr pca R[write to console]: Adding miscellaneous information for pca R[write to console]: Adding standard deviations for pca R[write to console]: Adding miscellaneous information for umap R[write to console]: Adding hvg to miscellaneous data R[write to console]: Adding log1p to miscellaneous data R[write to console]: Adding layer counts as data in assay counts R[write to console]: Adding layer counts as counts in assay counts R[write to console]: Reading H5Seurat... R[write to console]: Validating h5Seurat file R[write to console]: Warnung: R[write to console]: Feature names cannot have underscores (''), replacing with dashes ('-') R[write to console]: Initializing RNA with data R[write to console]: Adding counts for RNA R[write to console]: Adding feature-level metadata for RNA R[write to console]: Adding reduction pca R[write to console]: Adding cell embeddings for pca R[write to console]: Adding feature loadings for pca R[write to console]: Adding miscellaneous information for pca R[write to console]: Adding reduction umap R[write to console]: Adding cell embeddings for umap R[write to console]: Adding miscellaneous information for umap R[write to console]: Adding command information R[write to console]: Adding cell-level metadata R[write to console]: Read Seurat object: An object of class Seurat 2000 features across 100 samples within 1 assay Active assay: RNA (2000 features, 0 variable features) 2 dimensional reductions calculated: pca, umap Note that because the structure of a Seurat object is quite different from AnnData and SingleCellExperiment objects the conversion process is more complex. See the documentation of the conversion function for more details on how this is done. The {sceasy} package also provides a function for reading H5AD files as Seurat or SingleCellExperiment objects in a single step. {sceasy} also wraps Python functions but unlike {zellkonverter} it doesn’t use a special Python environment. This means you need to be responsible for setting up the environment, making sure that R can find it and that the correct packages are installed (again, this code is not run here). sceasy_seurat <- sceasy::convertFormat(h5ad_file, from="anndata", to="seurat") sceasy_seurat Warning: Feature names cannot have underscores (''), replacing with dashes ('-') X -> counts An object of class Seurat 2000 features across 100 samples within 1 assay Active assay: RNA (2000 features, 0 variable features) 2 dimensional reductions calculated: pca, umap 5.3.2.1.3. Reading/writing H5AD with {anndata}# The R {anndata} package can also be used to read H5AD files. However, unlike the packages above it does not convert to a native R object. Instead it provides an R interface to the Python object. This is useful for accessing the data but few analysis packages will accept this as input so further in-memory conversion is usually required. 5.3.2.2. Loom# The Loom file format is an older HDF5 specification for omics data. Unlike H5AD, it is not linked to a specific analysis ecosystem, although the structure is similar to AnnData and SingleCellExperiment objects. Packages implementing the Loom format exist for both R and Python as well as a Bioconductor package for writing Loom files. However, it is often more convenient to use the higher-level interfaces provided by the core ecosystem packages. Apart from sharing datasets another common place Loom files are encountered is when spliced/unspliced reads are quantified using velocyto for RNA velocity analysis. 5.3.3. RDS files# Another file format you may see used to share single-cell datasets is the RDS format. This is a binary format used to serialise arbitrary R objects (similar to Python Pickle files). As SingleCellExperiment and Seurat objects did not always have matching on-disk representations RDS files are sometimes used to share the results from R analyses. While this is ok within an analysis project we discourage its use for sharing data publicly or with collaborators due to the lack of interoperability with other ecosystems. Instead, we recommend using one of the HDF5 formats mentioned above that can be read from multiple languages. 5.3.4. New on-disk formats# While HDF5-based formats are currently the standard for on-disk representations of single-cell data other newer technologies such as Zarr and TileDB have some advantages, particularly for very large datasets and other modalities. We expect specifications to be developed for these formats in the future which may be adopted by the community (anndata already provides support for Zarr files). 5.4. In-memory interoperability# The second approach to interoperability is to work on in-memory representations of an object. This approach involves active sessions from two programming languages running at the same time and either accessing the same object from both or converting between them as needed. Usually, one language acts as the main environment and there is an interface to the other language. This can be very useful for interactive analysis as it allows an analyst to work in two languages simultaneously. It is also often used when creating documents that use multiple languages (such as this book). However, in-memory interoperability has some drawbacks as it requires the analyst to be familiar with setting up and using both environments, more complex objects are often not supported by both languages and there is a greater memory overhead as data can easily become duplicated (making it difficult to use for larger datasets). 5.4.1. Interoperability between R ecosystems# Before we look at in-memory interoperability between R and Python first let’s consider the simpler case of converting between the two R ecosystems. The {Seurat} package provides functions for performing this conversion as described in this vignette. %%R sce_from_seurat <- Seurat::as.SingleCellExperiment(seurat) sce_from_seurat class: SingleCellExperiment dim: 2000 100 metadata(0): assays(2): X logcounts rownames(2000): Gene-0 Gene-1 ... Gene-1998 Gene-1999 rowData names(0): colnames(100): Cell_0 Cell_1 ... Cell_98 Cell_99 colData names(9): n_genes_by_counts log1p_n_genes_by_counts ... pct_counts_in_top_500_genes ident reducedDimNames(2): PCA UMAP mainExpName: NULL altExpNames(0): %%R seurat_from_sce <- Seurat::as.Seurat(sce_from_seurat) seurat_from_sce An object of class Seurat 2000 features across 100 samples within 1 assay Active assay: RNA (2000 features, 0 variable features) 2 dimensional reductions calculated: PCA, UMAP The difficult part here is due to the differences between the structures of the two objects. It is important to make sure the arguments are set correctly so that the conversion functions know which information to convert and where to place it. In many cases it may not be necessary to convert a Seurat object to a SingleCellExperiment. This is because many of the core Bioconductor packages for single-cell analysis have been designed to also accept a matrix as input. %%R # Calculate Counts Per Million using the Bioconductor scuttle package # with a matrix in a Seurat object cpm <- scuttle::calculateCPM(Seurat::GetAssayData(seurat, slot = "counts")) cpm[1:10, 1:10] 10 x 10 sparse Matrix of class "dgCMatrix" [1,] 602.1263 622.1264 600.5326 1416.439 . . 965.8600 [2,] 602.1263 . . 613.435 618.2562 943.8910 609.1107 [3,] 602.1263 982.8946 1202.6879 969.506 618.2562 943.8910 1219.1005 [4,] . . 600.5326 613.435 976.3175 594.3451 . [5,] 602.1263 622.1264 . 1221.384 618.2562 594.3451 609.1107 [6,] 954.8394 982.8946 1202.6879 613.435 618.2562 943.8910 1219.1005 [7,] 954.8394 622.1264 952.6379 613.435 618.2562 . 965.8600 [8,] . 982.8946 . . 618.2562 594.3451 . [9,] 954.8394 . 600.5326 613.435 . 1192.4227 1219.1005 [10,] 954.8394 622.1264 952.6379 613.435 618.2562 943.8910 609.1107 [1,] 958.4081 599.5090 610.1969 [2,] 958.4081 952.2526 610.1969 [3,] . . 965.9120 [4,] . . . [5,] 605.0013 952.2526 . [6,] . . . [7,] 605.0013 599.5090 965.9120 [8,] 1209.0169 . 965.9120 [9,] . 599.5090 1413.3168 [10,] 1209.0169 . 610.1969 However, it is important to be sure you are accessing the right information and storing any results in the correct place if needed. 5.4.2. Accessing R from Python# The Python interface to R is provided by the rpy2 package. This allows you to access R functions and objects from Python. For example: counts_mat = adata.layers["counts"].T rpy2.robjects.globalenv["counts_mat"] = counts_mat cpm = rpy2.robjects.r("scuttle::calculateCPM(counts_mat)") cpm <2000x100 sparse matrix of type '<class 'numpy.float64'>' with 126146 stored elements in Compressed Sparse Column format> Common Python objects (lists, matrices, DataFrames etc.) can also be passed to R. If you are using a Jupyter notebook (as we are for this book) you can use the IPython magic interface to create cells with native R code (passing objects as required). For example, starting a cell with %%R -i input -o output says to take input as input, run R code and then return output as output. %%R -i counts_mat -o magic_cpm # R code running using IPython magic magic_cpm <- scuttle::calculateCPM(counts_mat) # Python code accessing the results magic_cpm <2000x100 sparse matrix of type '<class 'numpy.float64'>' with 126146 stored elements in Compressed Sparse Column format> This is the approach you will most commonly see in later chapters. For more information about using rpy2 please refer to the documentation. To work with single-cell data in this way the anndata2ri package is especially useful. This is an extension to rpy2 which allows R to see AnnData objects as SingleCellExperiment objects. This avoids unnecessary conversion and makes it easy to run R code on a Python object. It also enables the conversion of sparse scipy matrices like we saw above. In this example, we pass an AnnData object in the Python session to R which views it as a SingleCellExperiment that can be used by R functions. %%R -i adata qc <- scuttle::perCellQCMetrics(adata) head(qc) /Users/luke.zappia/miniconda3/envs/interoperability2/lib/python3.9/functools.py:888: NotConvertedWarning: Conversion 'py2rpy' not defined for objects of type '<class 'NoneType'>' return dispatch(args[0].class)(*args, **kw) sum detected total Cell_0 2005 1274 2005 Cell_1 1941 1233 1941 Cell_2 2011 1270 2011 Cell_3 1947 1268 1947 Cell_4 1933 1265 1933 Cell_5 2031 1289 2031 Note that you will still run into issues if an object (or part of it) cannot be interfaced correctly (for example if there is an unsupported data type). In that case, you may need to modify your object first before it can be accessed. 5.4.3. Accessing Python from R# Accessing Python from an R session is similar to accessing R from Python but here the interface is provided by the {reticulate} package. Once it is loaded we can access Python functions and objects from R. %%R reticulate_list <- reticulate::r_to_py(LETTERS) print(reticulate_list) py_builtins <- reticulate::import_builtins() py_builtins$zip(letters, LETTERS) List (26 items) <zip object at 0x18a243040> If you are working in an RMarkdown or Quarto document you can also write native Python chunks using the {reticulate} Python engine. When we do this we can use the magic r and py variables to access objects in the other language (the following code is an example that is not run). {r} # An R chunk that accesses a Python object print(py$py_object) {python} # A Python chunk that accesses an R object print(r$r_object) Unlike anndata2ri, there are no R packages that provide a direct interface for Python to view SingleCellExperiment or Seurat objects as AnnData objects. However, we can still access most parts of an AnnData using {reticulate} (this code is not run). # Print an AnnData object in a Python environment py$adata AnnData object with n_obs × n_vars = 100 × 2000 obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes' var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm' uns: 'hvg', 'log1p', 'neighbors', 'pca', 'umap' obsm: 'X_pca', 'X_umap' varm: 'PCs' layers: 'counts' obsp: 'connectivities', 'distances' # Alternatively use the Python anndata package to read a H5AD file anndata <- reticulate::import("anndata") anndata$read_h5ad(h5ad_file) AnnData object with n_obs × n_vars = 100 × 2000 obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes' var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm' uns: 'hvg', 'log1p', 'neighbors', 'pca', 'umap' obsm: 'X_pca', 'X_umap' varm: 'PCs' layers: 'counts' obsp: 'connectivities', 'distances' # Access the obs slot, pandas DataFrames are automatically converted to R data.frames head(adata$obs) n_genes_by_counts log1p_n_genes_by_counts total_counts Cell_0 1246 7.128496 1965 Cell_1 1262 7.141245 2006 Cell_2 1262 7.141245 1958 Cell_3 1240 7.123673 1960 Cell_4 1296 7.167809 2027 Cell_5 1231 7.116394 1898 log1p_total_counts pct_counts_in_top_50_genes Cell_0 7.583756 10.025445 Cell_1 7.604396 9.521436 Cell_2 7.580189 9.959142 Cell_3 7.581210 9.183673 Cell_4 7.614805 9.718796 Cell_5 7.549083 10.168599 pct_counts_in_top_100_genes pct_counts_in_top_200_genes Cell_0 17.65903 30.89059 Cell_1 16.99900 29.71087 Cell_2 17.62002 30.28601 Cell_3 16.83673 30.45918 Cell_4 17.11889 30.04440 Cell_5 18.07165 30.29505 pct_counts_in_top_500_genes Cell_0 61.42494 Cell_1 59.62114 Cell_2 60.92952 Cell_3 61.07143 Cell_4 59.64480 Cell_5 61.48577 As mentioned above the R {anndata} package provides an R interface for AnnData objects but it is not currently used by many analysis packages. For more complex analysis that requires a whole object to work on it may be necessary to completely convert an object from R to Python (or the opposite). This is not memory efficient as it creates a duplicate of the data but it does provide access to a greater range of packages. The {zellkonverter} package provides a function for doing this conversion (note that, unlike the function for reading H5AD files, this uses the normal Python environment rather than a specially created one) (code not run). # Convert an AnnData to a SingleCellExperiment sce <- zellkonverter::AnnData2SCE(adata, verbose = TRUE) sce ✔ uns$hvg$flavor converted [21ms] ✔ uns$hvg converted [62ms] ✔ uns$log1p converted [22ms] ✔ uns$neighbors converted [21ms] ✔ uns$pca$params$use_highly_variable converted [22ms] ✔ uns$pca$params$zero_center converted [31ms] ✔ uns$pca$params converted [118ms] ✔ uns$pca$variance converted [17ms] ✔ uns$pca$variance_ratio converted [17ms] ✔ uns$pca converted [224ms] ✔ uns$umap$params$a converted [15ms] ✔ uns$umap$params$b converted [17ms] ✔ uns$umap$params converted [80ms] ✔ uns$umap converted [115ms] ✔ uns converted [582ms] ✔ Converting uns to metadata ... done ✔ X matrix converted to assay [44ms] ✔ layers$counts converted [29ms] ✔ Converting layers to assays ... done ✔ var converted to rowData [37ms] ✔ obs converted to colData [23ms] ✔ varm$PCs converted [18ms] ✔ varm converted [49ms] ✔ Converting varm to rowData$varm ... done ✔ obsm$X_pca converted [17ms] ✔ obsm$X_umap converted [17ms] ✔ obsm converted [80ms] ✔ Converting obsm to reducedDims ... done ℹ varp is empty and was skipped ✔ obsp$connectivities converted [21ms] ✔ obsp$distances converted [22ms] ✔ obsp converted [89ms] ✔ Converting obsp to colPairs ... done ✔ SingleCellExperiment constructed [241ms] ℹ Skipping conversion of raw ✔ Converting AnnData to SingleCellExperiment ... done class: SingleCellExperiment dim: 2000 100 metadata(5): hvg log1p neighbors pca umap assays(2): X counts rownames(2000): Gene_0 Gene_1 ... Gene_1998 Gene_1999 rowData names(11): n_cells_by_counts mean_counts ... dispersions_norm varm colnames(100): Cell_0 Cell_1 ... Cell_98 Cell_99 colData names(8): n_genes_by_counts log1p_n_genes_by_counts ... pct_counts_in_top_200_genes pct_counts_in_top_500_genes reducedDimNames(2): X_pca X_umap mainExpName: NULL altExpNames(0): The same can also be done in reverse: adata2 <- zellkonverter::SCE2AnnData(sce, verbose = TRUE) adata2 ℹ Using the 'X' assay as the X matrix ✔ Selected X matrix [27ms] ✔ assays$X converted to X matrix [38ms] ✔ additional assays converted to layers [31ms] ✔ rowData$varm converted to varm [15ms] ✔ reducedDims converted to obsm [63ms] ✔ metadata converted to uns [23ms] ℹ rowPairs is empty and was skipped ✔ Converting AnnData to SingleCellExperiment ... done AnnData object with n_obs × n_vars = 100 × 2000 obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes' var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm' uns: 'X_name', 'hvg', 'log1p', 'neighbors', 'pca', 'umap' obsm: 'X_pca', 'X_umap' varm: 'PCs' layers: 'counts' obsp: 'connectivities', 'distances' 5.5. Interoperability for multimodal data# The complexity of multimodal data presents additional challenges for interoperability. Both the SingleCellExperiment (through “alternative experiments”, which must match in the column dimension (cells)) and Seurat (using “assays”) objects can store multiple modalities but the AnnData object is restricted to unimodal data. To address this limitation, the MuData object (introduced in the [analysis frameworks and tools chapter](analysis frameworks and tools chapter) was developed as as an extension of AnnData for multimodal datasets. The developers have considered interoperability in their design. While the main platform for MuData is Python, the authors have provided the MuDataSeurat R package for reading the on-disk H5MU format as Seurat objects and the MuData R package for doing the same with Bioconductor MultiAssayExperiment objects. This official support is very useful but there are still some inconsistencies due to differences between the objects. The MuData authors also provide a Julia implementation of AnnData and MuData. Below is an example of reading and writing a small example MuData dataset using the Python and R packages. 5.5.1. Python# # Read file mdata = mudata.read_h5mu("../../datasets/original.h5mu") print(mdata) # Write new file python_h5mu_file = Path(temp_dir.name) / "python.h5mu" mdata.write_h5mu(python_h5mu_file) MuData object with n_obs × n_vars = 411 × 56 obs: 'louvain', 'leiden', 'leiden_wnn', 'celltype' var: 'gene_ids', 'feature_types', 'highly_variable' obsm: 'X_mofa', 'X_mofa_umap', 'X_umap', 'X_wnn_umap' varm: 'LFs' obsp: 'connectivities', 'distances', 'wnn_connectivities', 'wnn_distances' 2 modalities prot: 411 x 29 var: 'gene_ids', 'feature_types', 'highly_variable' uns: 'neighbors', 'pca', 'umap' obsm: 'X_pca', 'X_umap' varm: 'PCs' layers: 'counts' obsp: 'connectivities', 'distances' rna: 411 x 27 obs: 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'celltype' var: 'gene_ids', 'feature_types', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std' uns: 'celltype_colors', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap' obsm: 'X_pca', 'X_umap' varm: 'PCs' obsp: 'connectivities', 'distances' /Users/luke.zappia/miniconda3/envs/interoperability2/lib/python3.9/site-packages/anndata/core/anndata.py:1230: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df[key] = c 5.5.2. R# 5.5.2.1. Bioconductor# Read/write from/to a MultiAssayExperiment object %%R mae <- MuData::readH5MU("../../datasets/original.h5mu") print(mae) bioc_h5mu_file <- tempfile(fileext = ".h5mu") MuData::writeH5MU(mae, bioc_h5mu_file) A MultiAssayExperiment object of 2 listed experiments with user-defined names and respective classes. Containing an ExperimentList class object of length 2: [1] prot: SingleCellExperiment with 29 rows and 411 columns [2] rna: SingleCellExperiment with 27 rows and 411 columns Functionality: experiments() - obtain the ExperimentList instance colData() - the primary/phenotype DataFrame sampleMap() - the sample coordination DataFrame $, [, [[ - extract colData columns, subset, or experiment *Format() - convert into a long or wide DataFrame assays() - convert ExperimentList to a SimpleList of matrices exportClass() - save data to flat files 5.5.2.2. Seurat# Read/write from/to a Seurat object %%R seurat <- MuDataSeurat::ReadH5MU("../../datasets/original.h5mu") print(seurat) seurat_h5mu_file <- tempfile(fileext = ".h5mu") MuDataSeurat::WriteH5MU(seurat, seurat_h5mu_file) R[write to console]: The legacy packages maptools, rgdal, and rgeos, underpinning the sp package, which was just loaded, will retire in October 2023. Please refer to R-spatial evolution reports for details, especially https://r-spatial.org/r/2023/05/15/evolution4.html. It may be desirable to make the sf package available; package maintainers should consider adding sf to Suggests:. The sp package is now running under evolution status 2 (status 2 uses the sf package in place of rgdal) WARNING: The R package "reticulate" only fixed recently an issue that caused a segfault when used with rpy2: https://github.com/rstudio/reticulate/pull/1188 Make sure that you use a version of that package that includes the fix. R[write to console]: Warnung: R[write to console]: Keys should be one or more alphanumeric characters followed by an underscore, setting key from prot to prot R[write to console]: Warnung: R[write to console]: No columnames present in cell embeddings, setting to 'MOFA_1:30' R[write to console]: Warnung: R[write to console]: Keys should be one or more alphanumeric characters followed by an underscore, setting key from MOFA_UMAP_ to MOFAUMAP_ R[write to console]: Warnung: R[write to console]: All keys should be one or more alphanumeric characters followed by an underscore '', setting key to MOFAUMAP R[write to console]: Warnung: R[write to console]: No columnames present in cell embeddings, setting to 'MOFAUMAP_1:2' R[write to console]: Warnung: R[write to console]: No columnames present in cell embeddings, setting to 'UMAP_1:2' R[write to console]: Warnung: R[write to console]: Keys should be one or more alphanumeric characters followed by an underscore, setting key from WNN_UMAP_ to WNNUMAP_ R[write to console]: Warnung: R[write to console]: All keys should be one or more alphanumeric characters followed by an underscore '', setting key to WNNUMAP R[write to console]: Warnung: R[write to console]: No columnames present in cell embeddings, setting to 'WNNUMAP_1:2' R[write to console]: Warnung: R[write to console]: No columnames present in cell embeddings, setting to 'protPCA_1:31' R[write to console]: Warnung: R[write to console]: No columnames present in cell embeddings, setting to 'protUMAP_1:2' R[write to console]: Warnung: R[write to console]: No columnames present in cell embeddings, setting to 'rnaPCA_1:50' R[write to console]: Warnung: R[write to console]: No columnames present in cell embeddings, setting to 'rnaUMAP_1:2' An object of class Seurat 56 features across 411 samples within 2 assays Active assay: prot (29 features, 29 variable features) 1 other assay present: rna 8 dimensional reductions calculated: MOFA, MOFA_UMAP, UMAP, WNN_UMAP, protPCA, protUMAP, rnaPCA, rnaUMAP R[write to console]: Added .var['highly_variable'] with highly variable features R[write to console]: Added .var['highly_variable'] with highly variable features 5.6. Interoperability with other languages# Here we briefly list some resources and tools for the interoperability of single-cell data with languages other than R and Python. 5.6.1. Julia# Muon.jl provides Julia implementations of AnnData and MuData objects, as well as IO for the H5AD and H5MU formats scVI.jl provides a Julia implementation of AnnData as well as IO for the H5AD format 5.6.2. JavaScript# Vitessce contains loaders from AnnData objects stored using the Zarr format The kana family supports reading H5AD files and SingleCellExperiment objects saved as RDS files 5.6.3. Rust# anndata-rs provides a Rust implementation of AnnData as well as advanced IO support for the H5AD format 5.7. Session information# 5.8. Python# import session_info session_info.show() Click to view session information ----- anndata 0.9.2 anndata2ri 1.2 mudata 0.2.3 numpy 1.24.4 rpy2 3.5.11 scanpy 1.9.3 scipy 1.9.3 session_info 1.0.0 ----- Click to view modules imported as dependencies CoreFoundation NA Foundation NA PIL 10.0.0 PyObjCTools NA anyio NA appnope 0.1.3 argcomplete NA arrow 1.2.3 asttokens NA attr 23.1.0 attrs 23.1.0 babel 2.12.1 backcall 0.2.0 beta_ufunc NA binom_ufunc NA brotli 1.0.9 certifi 2023.07.22 cffi 1.15.1 charset_normalizer 3.2.0 colorama 0.4.6 comm 0.1.4 cycler 0.10.0 cython_runtime NA dateutil 2.8.2 debugpy 1.6.8 decorator 5.1.1 defusedxml 0.7.1 executing 1.2.0 fastjsonschema NA fqdn NA h5py 3.9.0 hypergeom_ufunc NA idna 3.4 importlib_metadata NA importlib_resources NA ipykernel 6.25.0 ipython_genutils 0.2.0 ipywidgets 8.1.0 isoduration NA jedi 0.19.0 jinja2 3.1.2 joblib 1.3.0 json5 NA jsonpointer 2.0 jsonschema 4.18.6 jsonschema_specifications NA jupyter_events 0.7.0 jupyter_server 2.7.0 jupyterlab_server 2.24.0 kiwisolver 1.4.4 llvmlite 0.40.1 markupsafe 2.1.3 matplotlib 3.7.2 mpl_toolkits NA natsort 8.4.0 nbformat 5.9.2 nbinom_ufunc NA ncf_ufunc NA numba 0.57.1 objc 9.2 overrides NA packaging 23.1 pandas 2.0.3 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA platformdirs 3.10.0 prometheus_client NA prompt_toolkit 3.0.39 psutil 5.9.5 ptyprocess 0.7.0 pure_eval 0.2.2 pydev_ipython NA pydevconsole NA pydevd 2.9.5 pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.15.1 pyparsing 3.0.9 pythonjsonlogger NA pytz 2023.3 referencing NA requests 2.31.0 rfc3339_validator 0.1.4 rfc3986_validator 0.1.1 rpds NA send2trash NA six 1.16.0 sklearn 1.3.0 sniffio 1.3.0 socks 1.7.1 stack_data 0.6.2 threadpoolctl 3.2.0 tornado 6.3.2 traitlets 5.9.0 typing_extensions NA tzlocal NA uri_template NA urllib3 2.0.4 wcwidth 0.2.6 webcolors 1.13 websocket 1.6.1 yaml 6.0 zipp NA zmq 25.1.0 zoneinfo NA ----- IPython 8.14.0 jupyter_client 8.3.0 jupyter_core 5.3.1 jupyterlab 3.6.3 notebook 6.5.4 ----- Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:42:20) [Clang 14.0.6 ] macOS-13.4.1-x86_64-i386-64bit ----- Session information updated at 2023-11-15 15:48 5.9. R# %%R sessioninfo::session_info() ─ Session info ─────────────────────────────────────────────────────────────── setting value version R version 4.2.3 (2023-03-15) os macOS Ventura 13.4.1 system x86_64, darwin13.4.0 ui unknown language (EN) collate C ctype UTF-8 tz Europe/Berlin date 2023-11-15 pandoc 2.19.2 @ /Users/luke.zappia/miniconda3/envs/interoperability2/bin/pandoc ─ Packages ─────────────────────────────────────────────────────────────────── package * version date (UTC) lib source abind 1.4-5 2016-07-21 [1] CRAN (R 4.2.3) Biobase 2.58.0 2022-11-01 [1] Bioconductor BiocGenerics 0.44.0 2022-11-01 [1] Bioconductor bit 4.0.5 2022-11-15 [1] CRAN (R 4.2.3) bit64 4.0.5 2020-08-30 [1] CRAN (R 4.2.3) bitops 1.0-7 2021-04-24 [1] CRAN (R 4.2.3) cli 3.6.1 2023-03-23 [1] CRAN (R 4.2.3) cluster 2.1.4 2022-08-22 [1] CRAN (R 4.2.3) codetools 0.2-19 2023-02-01 [1] CRAN (R 4.2.3) colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.2.3) cowplot 1.1.1 2020-12-30 [1] CRAN (R 4.2.3) data.table 1.14.8 2023-02-17 [1] CRAN (R 4.2.3) DelayedArray 0.24.0 2022-11-01 [1] Bioconductor deldir 1.0-9 2023-05-17 [1] CRAN (R 4.2.3) digest 0.6.33 2023-07-07 [1] CRAN (R 4.2.3) dplyr 1.1.2 2023-04-20 [1] CRAN (R 4.2.3) ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.3) fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.3) fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.2.3) fitdistrplus 1.1-11 2023-04-25 [1] CRAN (R 4.2.3) future 1.33.0 2023-07-01 [1] CRAN (R 4.2.3) future.apply 1.11.0 2023-05-21 [1] CRAN (R 4.2.3) generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.3) GenomeInfoDb 1.34.9 2023-02-02 [1] Bioconductor GenomeInfoDbData 1.2.9 2023-11-10 [1] Bioconductor GenomicRanges 1.50.0 2022-11-01 [1] Bioconductor ggplot2 3.4.2 2023-04-03 [1] CRAN (R 4.2.3) ggrepel 0.9.3 2023-02-03 [1] CRAN (R 4.2.3) ggridges 0.5.4 2022-09-26 [1] CRAN (R 4.2.3) globals 0.16.2 2022-11-21 [1] CRAN (R 4.2.3) glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.3) goftest 1.2-3 2021-10-07 [1] CRAN (R 4.2.3) gridExtra 2.3 2017-09-09 [1] CRAN (R 4.2.3) gtable 0.3.3 2023-03-21 [1] CRAN (R 4.2.3) hdf5r 1.3.8 2023-01-21 [1] CRAN (R 4.2.3) htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.2.3) htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.2.3) httpuv 1.6.11 2023-05-11 [1] CRAN (R 4.2.3) httr 1.4.6 2023-05-08 [1] CRAN (R 4.2.3) ica 1.0-3 2022-07-08 [1] CRAN (R 4.2.3) igraph 1.5.0.1 2023-07-23 [1] CRAN (R 4.2.3) IRanges 2.32.0 2022-11-01 [1] Bioconductor irlba 2.3.5.1 2022-10-03 [1] CRAN (R 4.2.3) jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.2.3) KernSmooth 2.23-22 2023-07-10 [1] CRAN (R 4.2.3) later 1.3.1 2023-05-02 [1] CRAN (R 4.2.3) lattice 0.21-8 2023-04-05 [1] CRAN (R 4.2.3) lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.2.3) leiden 0.4.3 2022-09-10 [1] CRAN (R 4.2.3) lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.3) listenv 0.9.0 2022-12-16 [1] CRAN (R 4.2.3) lmtest 0.9-40 2022-03-21 [1] CRAN (R 4.2.3) magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.3) MASS 7.3-60 2023-05-04 [1] CRAN (R 4.2.3) Matrix 1.6-0 2023-07-08 [1] CRAN (R 4.2.3) MatrixGenerics 1.10.0 2022-11-01 [1] Bioconductor matrixStats 1.0.0 2023-06-02 [1] CRAN (R 4.2.3) mime 0.12 2021-09-28 [1] CRAN (R 4.2.3) miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.2.3) MuData 1.2.0 2022-11-01 [1] Bioconductor MuDataSeurat 0.0.0.9000 2023-11-15 [1] Github (PMBio/MuDataSeurat@e34e908) MultiAssayExperiment 1.24.0 2022-11-01 [1] Bioconductor munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.3) nlme 3.1-162 2023-01-31 [1] CRAN (R 4.2.3) parallelly 1.36.0 2023-05-26 [1] CRAN (R 4.2.3) patchwork 1.1.2 2022-08-19 [1] CRAN (R 4.2.3) pbapply 1.7-2 2023-06-27 [1] CRAN (R 4.2.3) pillar 1.9.0 2023-03-22 [1] CRAN (R 4.2.3) pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.3) plotly 4.10.2 2023-06-03 [1] CRAN (R 4.2.3) plyr 1.8.8 2022-11-11 [1] CRAN (R 4.2.3) png 0.1-8 2022-11-29 [1] CRAN (R 4.2.3) polyclip 1.10-4 2022-10-20 [1] CRAN (R 4.2.3) progressr 0.13.0 2023-01-10 [1] CRAN (R 4.2.3) promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.2.3) purrr 1.0.1 2023-01-10 [1] CRAN (R 4.2.3) R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.3) RANN 2.6.1 2019-01-08 [1] CRAN (R 4.2.3) RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.2.3) Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.2.3) RcppAnnoy 0.0.21 2023-07-02 [1] CRAN (R 4.2.3) RCurl 1.98-1.12 2023-03-27 [1] CRAN (R 4.2.3) reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.2.3) reticulate 1.30 2023-06-09 [1] CRAN (R 4.2.3) rhdf5 2.42.0 2022-11-01 [1] Bioconductor rhdf5filters 1.10.0 2022-11-01 [1] Bioconductor Rhdf5lib 1.20.0 2022-11-01 [1] Bioconductor rlang 1.1.1 2023-04-28 [1] CRAN (R 4.2.3) ROCR 1.0-11 2020-05-02 [1] CRAN (R 4.2.3) Rtsne 0.16 2022-04-17 [1] CRAN (R 4.2.3) S4Vectors 0.36.0 2022-11-01 [1] Bioconductor scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.3) scattermore 1.2 2023-06-12 [1] CRAN (R 4.2.3) sctransform 0.3.5 2022-09-21 [1] CRAN (R 4.2.3) sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.3) Seurat 4.3.0.1 2023-06-22 [1] CRAN (R 4.2.3) SeuratObject 4.1.3 2022-11-07 [1] CRAN (R 4.2.3) shiny 1.7.4.1 2023-07-06 [1] CRAN (R 4.2.3) SingleCellExperiment 1.20.0 2022-11-01 [1] Bioconductor sp 2.0-0 2023-06-22 [1] CRAN (R 4.2.3) spatstat.data 3.0-1 2023-03-12 [1] CRAN (R 4.2.3) spatstat.explore 3.2-1 2023-05-13 [1] CRAN (R 4.2.3) spatstat.geom 3.2-4 2023-07-20 [1] CRAN (R 4.2.3) spatstat.random 3.1-5 2023-05-11 [1] CRAN (R 4.2.3) spatstat.sparse 3.0-2 2023-06-25 [1] CRAN (R 4.2.3) spatstat.utils 3.0-3 2023-05-09 [1] CRAN (R 4.2.3) stringi 1.7.12 2023-01-11 [1] CRAN (R 4.2.3) stringr 1.5.0 2022-12-02 [1] CRAN (R 4.2.3) SummarizedExperiment 1.28.0 2022-11-01 [1] Bioconductor survival 3.5-5 2023-03-12 [1] CRAN (R 4.2.3) tensor 1.5 2012-05-05 [1] CRAN (R 4.2.3) tibble 3.2.1 2023-03-20 [1] CRAN (R 4.2.3) tidyr 1.3.0 2023-01-24 [1] CRAN (R 4.2.3) tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.3) utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.3) uwot 0.1.16 2023-06-29 [1] CRAN (R 4.2.3) vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.2.3) viridisLite 0.4.2 2023-05-02 [1] CRAN (R 4.2.3) xtable 1.8-4 2019-04-21 [1] CRAN (R 4.2.3) XVector 0.38.0 2022-11-01 [1] Bioconductor zlibbioc 1.44.0 2022-11-01 [1] Bioconductor zoo 1.8-12 2023-04-13 [1] CRAN (R 4.2.3) [1] /Users/luke.zappia/miniconda3/envs/interoperability2/lib/R/library ────────────────────────────────────────────────────────────────────────────── 5.10. References# 5.11. Contributors# We gratefully acknowledge the contributions of: 5.11.1. Authors# Luke Zappia 5.11.2. Reviewers# Lukas Heumos Isaac Virshup Anastasia Litinetskaya Ludwig Geistlinger Peter Hickey previous 4. Analysis frameworks and tools next 6. Quality Control Contents 5.1. Motivation 5.2. Nomenclature 5.3. Disk-based interoperability 5.3.1. Simple formats 5.3.2. HDF5-based formats 5.3.2.1. H5AD 5.3.2.1.1. Reading/writing H5AD with Bioconductor 5.3.2.1.2. Reading/writing H5AD with {Seurat} 5.3.2.1.3. Reading/writing H5AD with {anndata} 5.3.2.2. Loom 5.3.3. RDS files 5.3.4. New on-disk formats 5.4. In-memory interoperability 5.4.1. Interoperability between R ecosystems 5.4.2. Accessing R from Python 5.4.3. Accessing Python from R 5.5. Interoperability for multimodal data 5.5.1. Python 5.5.2. R 5.5.2.1. Bioconductor 5.5.2.2. Seurat 5.6. Interoperability with other languages 5.6.1. Julia 5.6.2. JavaScript 5.6.3. Rust 5.7. Session information 5.8. Python 5.9. R 5.10. References 5.11. Contributors 5.11.1. Authors 5.11.2. Reviewers By Lukas Heumos, Anna Schaar, single-cell best practices consortium © Copyright 2023. Brought to you by Theislab, with many thanks to the single-cell community as a whole!

package::function()

Pattern 5: The Python interface to R is provided by the rpy2 package. This allows you to access R functions and objects from Python. For example:

counts_mat = adata.layers["counts"].T
rpy2.robjects.globalenv["counts_mat"] = counts_mat
cpm = rpy2.robjects.r("scuttle::calculateCPM(counts_mat)")
cpm

Example Code Patterns

Example 1 (python):

sc.logging.print_versions()

Example 2 (python):

path_data = "data"
path_tcr = f"{path_data}/TCR_01_preprocessed.h5ad"
adata_tcr = sc.read(path_tcr)

Example 3 (python):

import scanpy as sc

sc.settings.verbosity = 0
sc.settings.set_figure_params(dpi=80, facecolor="white", frameon=False)

Example 4 (python):

adata = pt.dt.haber_2017_regions()

Example 5 (python):

from __future__ import annotations

import anndata as ad
import decoupler
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn.objects as so
import session_info

Reference Files

This skill includes comprehensive documentation in references/:

acknowledgements.md - Acknowledgements documentation
air_repertoire_clonotype.md - Air Repertoire Clonotype documentation
air_repertoire_ir_profiling.md - Air Repertoire Ir Profiling documentation
air_repertoire_multimodal_integration.md - Air Repertoire Multimodal Integration documentation
air_repertoire_specificity.md - Air Repertoire Specificity documentation
cellular_structure_annotation.md - Cellular Structure Annotation documentation
cellular_structure_clustering.md - Cellular Structure Clustering documentation
cellular_structure_integration.md - Cellular Structure Integration documentation
chromatin_accessibility_gene_regulatory_networks_atac.md - Chromatin Accessibility Gene Regulatory Networks Atac documentation
chromatin_accessibility_introduction.md - Chromatin Accessibility Introduction documentation
chromatin_accessibility_quality_control.md - Chromatin Accessibility Quality Control documentation
conditions_compositional.md - Conditions Compositional documentation
conditions_differential_gene_expression.md - Conditions Differential Gene Expression documentation
conditions_gsea_pathway.md - Conditions Gsea Pathway documentation
conditions_perturbation_modeling.md - Conditions Perturbation Modeling documentation
deconvolution_bulk_deconvolution.md - Deconvolution Bulk Deconvolution documentation
glossary.md - Glossary documentation
index.md - Index documentation
introduction_analysis_tools.md - Introduction Analysis Tools documentation
introduction_interoperability.md - Introduction Interoperability documentation
introduction_prior_art.md - Introduction Prior Art documentation
introduction_raw_data_processing.md - Introduction Raw Data Processing documentation
introduction_scrna_seq.md - Introduction Scrna Seq documentation
mechanisms_cell_cell_communication.md - Mechanisms Cell Cell Communication documentation
mechanisms_gene_regulatory_networks.md - Mechanisms Gene Regulatory Networks documentation
multimodal_integration_advanced_integration.md - Multimodal Integration Advanced Integration documentation
multimodal_integration_paired_integration.md - Multimodal Integration Paired Integration documentation
outlook.md - Outlook documentation
preamble.md - Preamble documentation
preprocessing_visualization_dimensionality_reduction.md - Preprocessing Visualization Dimensionality Reduction documentation
preprocessing_visualization_feature_selection.md - Preprocessing Visualization Feature Selection documentation
preprocessing_visualization_normalization.md - Preprocessing Visualization Normalization documentation
preprocessing_visualization_quality_control.md - Preprocessing Visualization Quality Control documentation
reproducibility_introduction.md - Reproducibility Introduction documentation
spatial_deconvolution.md - Spatial Deconvolution documentation
spatial_domains.md - Spatial Domains documentation
spatial_imputation.md - Spatial Imputation documentation
spatial_introduction.md - Spatial Introduction documentation
spatial_neighborhood.md - Spatial Neighborhood documentation
spatial_spatially_variable_genes.md - Spatial Spatially Variable Genes documentation
surface_protein_annotation.md - Surface Protein Annotation documentation
surface_protein_batch_correction.md - Surface Protein Batch Correction documentation
surface_protein_dimensionality_reduction.md - Surface Protein Dimensionality Reduction documentation
surface_protein_doublet_detection.md - Surface Protein Doublet Detection documentation
surface_protein_normalization.md - Surface Protein Normalization documentation
surface_protein_quality_control.md - Surface Protein Quality Control documentation
trajectories_lineage_tracing.md - Trajectories Lineage Tracing documentation
trajectories_pseudotemporal.md - Trajectories Pseudotemporal documentation
trajectories_rna_velocity.md - Trajectories Rna Velocity documentation

Use view to read specific reference files when detailed information is needed.

Working with This Skill

For Beginners

Start with the getting_started or tutorials reference files for foundational concepts.

For Specific Features

Use the appropriate category reference file (api, guides, etc.) for detailed information.

For Code Examples

The quick reference section above contains common patterns extracted from the official docs.

Resources

references/

Organized documentation extracted from official sources. These files contain:

Detailed explanations
Code examples with language annotations
Links to original documentation
Table of contents for quick navigation

scripts/

Add helper scripts here for common automation tasks.

assets/

Add templates, boilerplate, or example projects here.

Notes

This skill was automatically generated from official documentation
Reference files preserve the structure and examples from source docs
Code examples include language detection for better syntax highlighting
Quick reference patterns are extracted from common usage examples in the docs

Updating

To refresh this skill with updated documentation:

Re-run the scraper with the same configuration
The skill will be rebuilt with the latest information

sc-best-practices-auto

$ Installieren