Mendez M, Harshbarger J, Hoffman MM. Automated identification of cell-type–specific genes and alternative promoters. BioRxiv 2021.12.01.470587 [Preprint]. 2021. Available from: https://doi.org/10.1101/2021.12.01.470587
The cell lineage analysis (CLA) infentifies cell-type–specific genes from many large collections of transcriptome datasets. It works on both bulk transcriptome datasets such as FANTOM5 CAGE, as well as single cell RNA-seq.
The CLA method uses multiple tools to quickly identify key marker genes from datasets containing many cell types. It uses an algorithm based on random forest to find the most differentially expressed genes between two cell types. To find cell-type–specific genes, a user only needs to provide a gene expression table and the cell types of origin of each sample in a YAML format. Then, CLA runs random forest against each cell type pairs and computes a lineage score for gene and each cell type. The lineage score indicates how often a gene is upregulated or downregulated in each cell type. Typically, the genes with the highest score represent the most cell-type–specific genes.
We used CLA to identify markers for 71 cell types from the FANTOM5 CAGE datasets. We also used the FANTOM5 Cell Ontology to visualize CLA lineage scores in a lineage context. This allowed efficiently identifying the lineage specificity known specific transcription factors (SPI1), known lncRNA with unknown function, and alternative promoters of cell surface markers. We also applied CLA to single-cell datasets and found that it can identify the most differentially expressed genes between two clusters while excluding zero-inflated genes.
CLA was developed by Mickaël Mendez, and Michael Hoffman.