Number of Protein-Coding Genes in the Human Genome
The human genome contains approximately 20,000-25,000 protein-coding genes, with current estimates placing the number at fewer than 20,000 genes. 1, 2
Current Gene Count Estimates
The most recent and authoritative sources provide converging estimates:
- GENCODE release 27 (2018) reports 19,836 protein-coding genes in the human genome, representing the most conservative and well-validated count 3
- The 2004 completion of the euchromatic human genome sequence estimated 20,000-25,000 protein-coding genes 2
- A 2023 review confirms that protein-coding genes are "currently estimated to number fewer than 20,000" 1
Database Variations and Why Numbers Differ
Different gene annotation databases report varying numbers due to their distinct inclusion criteria and validation standards:
- Ensembl (2017): 26,998 protein-coding genes with 81,787 mRNA transcripts, using an inclusive approach 3
- GenBank RefSeq (2017): 21,104 protein-coding genes with 34,799 mRNA transcripts, using more conservative criteria requiring peer-reviewed evidence 3
- These discrepancies arise because gene models are often incomplete and change over time, with the precise structure of many genes still under debate 3
Historical Context and Refinement
The gene count has been progressively refined downward from initial estimates:
- Early estimates ranged from 50,000 to over 140,000 genes 4
- A 2007 analysis reduced the catalog to approximately 20,500 protein-coding genes by excluding non-conserved ORFs that likely represent random occurrences rather than functional genes 5
- The 2000 Exofish analysis estimated 28,000-34,000 genes 4
The reduction in gene count reflects improved methodology for distinguishing true protein-coding genes from functionally meaningless open reading frames (ORFs) present by chance in RNA transcripts 5
Non-Coding RNA Genes
Beyond protein-coding genes, the genome contains a substantial number of non-coding RNA genes:
- GENCODE release 27 reports 23,347 non-protein-coding RNA genes, with 15,778 classified as long non-coding RNAs (lncRNAs) 3
- The total number of non-coding RNA genes now exceeds the number of protein-coding genes 3
- However, for most non-coding RNAs, functional relevance remains unclear 1
Clinical Implications
For whole exome sequencing (WES) applications:
- The human exome comprises approximately 180,000 exons representing about 1% of the genome (approximately 30 million nucleotides) 6
- Each exome contains about 13,500 single nucleotide variants affecting amino acid sequences 6
- More than three-quarters of known disease-causing variants are located in the exome 6
Key Caveats
The human gene catalog remains incomplete despite decades of effort 1:
- Gene models are incomplete and continue to evolve 3
- The number of distinct protein-coding isoforms continues to expand through alternative splicing 1
- No universal annotation standard exists that includes all medically significant genes 1
- Different databases may place the same variant in a coding exon versus an intron depending on their gene models 3