CLA: finding lineage-specific genes from transcriptome datasets and the cell ontology.

Mendez M, Harshbarger J, Hoffman MM. Automated identification of cell-type–specific genes and alternative promoters. BioRxiv 2021.12.01.470587 [Preprint]. 2021. Available from:

The cell lineage analysis (CLA) infentifies cell-type–specific genes from many large collections of transcriptome datasets. It works on both bulk transcriptome datasets such as FANTOM5 CAGE, as well as single cell RNA-seq.

CLA method

CLA features

The CLA method uses multiple tools to quickly identify key marker genes from datasets containing many cell types. It uses an algorithm based on random forest to find the most differentially expressed genes between two cell types. To find cell-type–specific genes, a user only needs to provide a gene expression table and the cell types of origin of each sample in a YAML format. Then, CLA runs random forest against each cell type pairs and computes a lineage score for gene and each cell type. The lineage score indicates how often a gene is upregulated or downregulated in each cell type. Typically, the genes with the highest score represent the most cell-type–specific genes.


We used CLA to identify markers for 71 cell types from the FANTOM5 CAGE datasets. We also used the FANTOM5 Cell Ontology to visualize CLA lineage scores in a lineage context. This allowed efficiently identifying the lineage specificity known specific transcription factors (SPI1), known lncRNA with unknown function, and alternative promoters of cell surface markers. We also applied CLA to single-cell datasets and found that it can identify the most differentially expressed genes between two clusters while excluding zero-inflated genes.

CLA method


CLA is divided into two tools, both available on GitHub: skrrf contains the regularized random forest algorithm and claweb contains all the tools to run the pairwise comparisons, generate the figures, and summarize the results a website for exploratory analysis.

Paper data

Data and results:
Website with results from TF, lncRNA, and CD genes analysis:


CLA was developed by Mickaël Mendez, and Michael Hoffman.