Changes to the 2. chapter: "Single-cell RNA sequencing"! (#329)

LuisHeinzlmeier · Luis · Zethson · web-flow · commit fb70d5d06119 · 2025-02-20T14:42:04.000+01:00
* Improve wording, insert missing spaces and symbols (, and .) and add a note field until third generation sequencing (inclusive).

* proofreading of two paragraphs

* Proofreading until the end of the chapter

* again proofreading until RNA sequencing (grammar)

* change v3 of the artifact actions to v4

* adding key takeaways and some missing terms

* improvements based on Lukas comments + added cards with internal links for the key takeaways

* change one sentence

* Update jupyter-book/introduction/scrna_seq.md

Suggestion from Lukas 1

Co-authored-by: Lukas Heumos &lt;lukas.heumos@posteo.net&gt;

* Update jupyter-book/introduction/scrna_seq.md

Suggestion from Lukas 2

Co-authored-by: Lukas Heumos &lt;lukas.heumos@posteo.net&gt;

* put key takeaways in a dropdown and made them shorter

* update terms and put further readings into a {seealso} dropdown box

* adding many terms of the glossary

* change to new anchor logic

---------

Co-authored-by: Luis &lt;ge34lah@mytum.de&gt;
Co-authored-by: Lukas Heumos &lt;lukas.heumos@posteo.net&gt;
diff --git a/jupyter-book/air_repertoire/clonotype.ipynb b/jupyter-book/air_repertoire/clonotype.ipynb
@@ -2033,7 +2033,7 @@
    "source": [
     "Dandelion defines clonotypes using a substitution model based on distances. It was created specifically to deal with the problem of somatic hypermutation in B-cells {cite}`yaari2013models` {cite}`cui2016model`. This model was available in the **Immcantation** suite as an R package {cite}`gupta2015change` {cite}`vander2014presto`. However, Dandelion makes possible to use it, avoiding the complication of moving between code languages and keeping the interoperability with *Scanpy* and *Scirpy*.\n",
     "\n",
-    "The model was created based on the probability of a punctual nucleotide change, considering the influence of the immediate two down- and upstream nucleotides {cite}`yaari2013models`. This methodology considered all the possible different 5-mers combinations just for the synonym mutation cases, i.e., those changes where the amino acid represented by the codon is not modified {cite}`yaari2013models`.\n",
+    "The model was created based on the probability of a punctual nucleotide change, considering the influence of the immediate two down- and upstream nucleotides {cite}`yaari2013models`. This methodology considered all the possible different 5-mers combinations just for the synonym mutation cases, i.e., those changes where the amino acid represented by the {term}`codon` is not modified {cite}`yaari2013models`.\n",
     "\n",
     "Furthermore, Dandelion considers a model of substitution rates for single nucleotide instead of the 5-mer model. Therefore, all the substitutions are not changing, and they are displayed in the table below:\n",
     "\n",
diff --git a/jupyter-book/air_repertoire/ir_profiling.ipynb b/jupyter-book/air_repertoire/ir_profiling.ipynb
@@ -564,7 +564,7 @@
     "Even though we can detect the AIR sequence, it might not be productive, i.e., it might not form a valid AIR. Sequences, which do not result in functional AIRs, are therefore flagged as non-productive. These are usually ignored, when loading data by tools such as Scirpy, and not used for any downstream analysis.\n",
     "Productive Immune receptors are defined by 10x Genomics [here](https://kb.10xgenomics.com/hc/en-us/articles/115003248383-What-are-productive-contigs-) as:\n",
     "- Sequences spanning over from a V gene to a J-gene\n",
-    "- Having a start codon in the leading region\n",
+    "- Having a start {term}`codon` in the leading region\n",
     "- Containing a CDR3 in the frame of the start codon.\n",
     "- Do not contain a stop codon within the V-J span"
    ]
diff --git a/jupyter-book/chromatin_accessibility/introduction.ipynb b/jupyter-book/chromatin_accessibility/introduction.ipynb
@@ -66,7 +66,7 @@
     "tags": []
    },
    "source": [
-    "As depicted above, chromatin accessibility is influenced by higher-order structure down to low-level DNA modifications. **(1)** Chromatin scaffolding driven by scaffold/matrix attachment regions (S/MARs) and proteins in the nuclear periphery such as nuclear pore complexes (NPCs) or lamins influences chromatin compactness and gene expression {cite}`atac:narwade_mapping_2019, atac:buchwalter_coaching_2019`. **(2, 3)** More local accessibility often referred to as densly packed heterochromatin versus open euchromatin can be actively controlled by ATP-dependent and ATP-independent chromatin remodeling complexes and histone modifications such as acetylation, methylation and phosphorylation. **(4)** Also the binding of transcription factors can influence nucleosome positioning and lead to the recruitment of histone-modifying enzymes and chromatin remodelers. **(5)** On a DNA level, methylation of CpG sites influences the binding affinity of various proteins including transcription factors and histone-modifying enzymes which combined leads to the silencing of the corresponding genomic regions. For an animated visualization we also recommend [this 2 minute video](https://www.youtube.com/watch?v=XelGO582s4U) on epigenetics and the regulation of gene activity (credits to Nicole Ethen from the SQE, University of Illinois). For a comprehensive and up-to-date review on genome regulation and TF activity, we refer to {cite}`atac:isbel_generating_2022`.\n",
+    "As depicted above, chromatin accessibility is influenced by higher-order structure down to low-level DNA modifications. **(1)** Chromatin scaffolding driven by scaffold/matrix attachment regions (S/MARs) and proteins in the nuclear periphery such as nuclear pore complexes (NPCs) or lamins influences chromatin compactness and gene expression {cite}`atac:narwade_mapping_2019, atac:buchwalter_coaching_2019`. **(2, 3)** More local accessibility often referred to as densly packed heterochromatin versus open euchromatin can be actively controlled by ATP-dependent and ATP-independent chromatin remodeling complexes and histone modifications such as acetylation, methylation and phosphorylation. **(4)** Also the binding of transcription factors can influence nucleosome positioning and lead to the recruitment of histone-modifying enzymes and chromatin remodelers. **(5)** On a DNA level, methylation of {term}`CpG` sites influences the binding affinity of various proteins including transcription factors and histone-modifying enzymes which combined leads to the silencing of the corresponding genomic regions. For an animated visualization we also recommend [this 2 minute video](https://www.youtube.com/watch?v=XelGO582s4U) on epigenetics and the regulation of gene activity (credits to Nicole Ethen from the SQE, University of Illinois). For a comprehensive and up-to-date review on genome regulation and TF activity, we refer to {cite}`atac:isbel_generating_2022`.\n",
     "\n",
     "Taken together, an essential component defining cell identity is the regulatory state of each cell. In this chapter, we focus on chromatin accessibility data measured by the **Single-Cell Assay for Transposase-Accessible Chromatin with High-Throughput Sequencing (scATAC-seq)** or as part of the **10x Multiome assay (scATAC combined with scRNA-seq)**. \n",
     "\n",
diff --git a/jupyter-book/glossary.md b/jupyter-book/glossary.md
@@ -22,10 +22,14 @@ BAM files
     BAM files are binary, compressed versions of SAM (Sequence Alignment/Map) files that store sequencing read alignments to a reference genome.
     They contain the same information as {term}`SAM` files - including read sequences, quality scores, and alignment positions - but in a more space-efficient format that enables faster processing and reduced storage requirements.
 
+Amplification bias
+    A distortion that occurs during DNA or RNA amplification (e.g., PCR), where certain sequences are copied more efficiently than others. This can lead to uneven or inaccurate representation of the original genetic material, affecting results in experiments like sequencing or gene expression analysis.
+
 Barcode
 Barcodes
 Bar code
-Bar codes
+Bar code
+Cell barcode
     Short DNA barcode fragments ("tags") that are used to identify reads originating from the same cell.
     Reads are later grouped by their barcode during raw data processing steps.
 
@@ -37,15 +41,15 @@ Benchmark
     An (independent) comparison of performance of several tools with respect to pre-defined metrics.
 
 Bulk RNA sequencing
-    Contrary to single-cell sequencing, bulk sequencing measures the average expression values of several cells.
-    Therefore, resolution is lost, but bulk sequencing is usually cheaper, less laborious and faster to analyze.
+bulk RNA-Seq
+bulk sequencing
+    Contrary to single-cell sequencing, bulk sequencing measures the average expression values of several cells. Therefore, resolution is lost, but bulk sequencing is usually cheaper, less laborious and faster to analyze.
 
 Cell
+cells
     The fundamental unit of life, consisting of cytoplasm enclosed within a membrane, containing biomolecules such as proteins and nucleic acids.
     Cells acquire specific functions, transition into different types, divide, and communicate to sustain an organism.
     Studying cell structure, activity, and interactions enables insights into gene expression dynamics, cellular trajectories, developmental lineages, and disease mechanisms.
-Cell barcode
-    See {term}`barcode`
 
 Cell type annotation
     The process of labeling groups of {term}`clusters` of cells by {term}`cell type`.
@@ -60,6 +64,15 @@ Cell state
 Chromatin
     The complex of DNA and proteins efficiently packaging the DNA inside the nucleus and involved in regulating gene expression.
 
+Codon
+    A sequence of three nucleotides corresponding to a specific amino acid or a start/stop signal in protein synthesis.
+    Codons are the basic units of the genetic code, determining how genetic information is translated into proteins.
+
+CpG
+    A DNA sequence in which a cytosine (C) is followed by a guanine (G) along the 5' &rarr; 3' direction, linked by a phosphodiester bond.
+    CpG sites are often found in clusters called CpG islands near gene promoters.
+    Unmethylated CpG sites are associated with gene activation, while methylated CpG sites can lead to gene inhibition.
+
 Cluster
 Clusters
     A group of a population or data points that share similarities.
@@ -129,6 +142,12 @@ Indrop
 Library
     Also known as sequencing library. A pool of DNA fragments with attached sequencing adapters.
 
+Modalities
+Multimodal
+    Different types of biological information measured at the single-cell level.
+    These include gene expression, chromatin accessibility, surface proteins, immune receptor sequences, and spatial organization.
+    Combining these modalities provides a more complete understanding of cell identity, function, and interactions.
+
 Locus
 Loci
 loci
@@ -209,9 +228,9 @@ Trajectory inference
     The computational recovery of dynamic processes by ordering cells by similarity or other means.
 
 Unique Molecular Identifier (UMI)
-unique molecular identifiers (UMIs)
-    Specific type of molecular barcodes aiding with error correction and increased accuracy during sequencing.
-    UMIs unique tag molecules in sample libraries enabling estimation of PCR duplication rates.
+UMI
+    A special type of molecular barcode that uniquely tags each molecule in a sample library.
+    This, for example, enables the estimation of PCR duplication rates (see {term}`amplification bias`), which leads to error correction and increases accuracy.
 
 Untranslated Region (UTR)
 UTR
diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md
@@ -496,8 +496,7 @@ Several common strategies are used for cell barcode identification and correctio
 After cell barcode (CB) correction, reads have either been discarded or assigned to a corrected CB.
 Subsequently, we wish to quantify the abundance of each gene within each corrected CB.
 
-Because of the amplification bias as discussed in {ref}`exp-data:transcript-quantification`, reads must be deduplicated, based upon their UMI, to assess the true count of sampled molecules.
-Additionally, several other complicating factors present challenges when attempting to perform this estimation.
+Because of the {term}`amplification bias` as discussed in {ref}`exp-data:transcript-quantification`, reads must be deduplicated, based upon their UMI, to assess the true count of sampled molecules. Additionally, several other complicating factors present challenges when attempting to perform this estimation.
 
 The UMI deduplication step aims to identify the set of reads and UMIs derived from each original, pre-PCR molecule in each cell captured and sequenced in the experiment.
 The result of this process is to allocate a molecule count to each gene in each cell, which is subsequently used in the downstream analysis as the raw expression estimate for this gene.
diff --git a/jupyter-book/introduction/scrna_seq.bib b/jupyter-book/introduction/scrna_seq.bib
@@ -38,11 +38,11 @@ @Article{Svensson2017
 url={https://doi.org/10.1038/nmeth.4220}
 }
 
-﻿@Article{JOU1972,
-author={JOU, W. MIN
-and HAEGEMAN, G.
-and YSEBAERT, M.
-and FIERS, W.},
+﻿@Article{Jou1972,
+author={Jou, W. Min
+and Haegeman, G.
+and Ysebaert, M.
+and Fiers, W.},
 title={Nucleotide Sequence of the Gene Coding for the Bacteriophage MS2 Coat Protein},
 journal={Nature},
 year={1972},
diff --git a/jupyter-book/introduction/scrna_seq.md b/jupyter-book/introduction/scrna_seq.md

Original file line number	Diff line number	Diff line change
`@@ -564,7 +564,7 @@`
`564`	`564`	`"Even though we can detect the AIR sequence, it might not be productive, i.e., it might not form a valid AIR. Sequences, which do not result in functional AIRs, are therefore flagged as non-productive. These are usually ignored, when loading data by tools such as Scirpy, and not used for any downstream analysis.\n",`
`565`	`565`	`"Productive Immune receptors are defined by 10x Genomics [here](https://kb.10xgenomics.com/hc/en-us/articles/115003248383-What-are-productive-contigs-) as:\n",`
`566`	`566`	`"- Sequences spanning over from a V gene to a J-gene\n",`
`567`		`- "- Having a start codon in the leading region\n",`
	`567`	+ "- Having a start {term}`codon` in the leading region\n",
`568`	`568`	`"- Containing a CDR3 in the frame of the start codon.\n",`
`569`	`569`	`"- Do not contain a stop codon within the V-J span"`
`570`	`570`	`]`