Human genome
The human genome comprises the full complement of DNA sequences found in human cells, organised primarily within the 23 nuclear chromosome pairs and a much smaller circular genome contained in mitochondria. Together, these components encode protein-coding genes, noncoding RNA genes, regulatory elements and a large array of repetitive and mobile DNA. Advances in high-throughput sequencing have enabled near-complete determination of the human genome, although significant work remains to interpret the biological roles of many of its components. Contemporary genomic research extends beyond a single reference sequence to embrace genetic diversity across populations, culminating in the recent development of pangenome references.
Structure and Size of the Genome
The nuclear genome is distributed across 22 autosomes and two sex chromosomes (X and Y), whereas mitochondrial DNA forms a separate, maternally inherited molecule. The current human reference genome (GRCh38.p14, released in 2023) contains approximately 3.1 billion base pairs, representing a haploid chromosome set based on combined data from several individuals. Most somatic human cells are diploid, therefore harbouring around 6.2 billion base pairs.
Sequencing milestones include the first near-complete assembly in 2000, with approximately 88 per cent resolved, followed by a fully sequenced female genome in 2021 and the complete sequencing of the Y chromosome in 2022. In 2023, a draft human pangenome reference encompassing 47 genetically diverse individuals was published, with future plans aimed at broadening global representation. Genetic variation among humans is modest—roughly 0.1 per cent for single-nucleotide variants and 0.6 per cent when insertions and deletions are included—substantially smaller than the corresponding differences between humans and chimpanzees.
Protein-Coding Genes and Gene Organisation
Protein-coding sequences constitute the most extensively studied portion of the genome. The human reference genome carries an estimated 19,000–20,000 protein-coding genes, each typically containing around ten introns, with an average intron length of approximately 6 kilobases. This structure yields an average gene size of about 62 kilobases. Although exons encode proteins and untranslated regions, they occupy only about 1.2 per cent of the genome. In contrast, introns and associated noncoding segments mean that protein-coding genes collectively account for around 40 per cent of the total genomic sequence.
Protein diversity arises not simply from gene number but through mechanisms such as alternative splicing and V(D)J recombination. These processes substantially expand the repertoire of distinct proteins that can be produced from the relatively modest number of protein-coding genes.
Noncoding Genes and RNA Components
Noncoding RNA genes fulfil critical roles in gene regulation, protein biosynthesis and numerous cellular processes. These include genes encoding:
- transfer RNAs
- ribosomal RNAs
- microRNAs
- small nuclear RNAs
- long noncoding RNAs
Many of these RNAs participate in epigenetic regulation, transcriptional control, RNA splicing and translational machinery. Although thousands of noncoding genes have been identified, the total number and functional significance of many noncoding RNAs remain a subject of ongoing research. Some noncoding transcripts may be nonfunctional, whereas others are key regulators of gene expression and cellular behaviour.
Pseudogenes and Gene Duplication
Pseudogenes are inactivated copies of protein-coding genes produced largely through gene duplication followed by mutational degradation. The human genome contains roughly 13,000 pseudogenes, with some chromosomes showing almost equal numbers of pseudogenes and functional genes. Gene duplication is an important mechanism of evolutionary innovation, supplying genetic material that may acquire new functions.
The olfactory receptor gene family provides a striking example: over 60 per cent of human olfactory receptor genes are pseudogenes, compared with about 20 per cent in mice. This difference correlates with humans’ comparatively weaker sense of smell and highlights species-specific evolutionary paths.
Regulatory DNA and Gene Expression Control
Regulatory sequences are noncoding regions that control when and where genes are expressed. These include promoters, enhancers, silencers, and other cis-acting elements. Estimates suggest that at least 8 per cent of the genome comprises regulatory DNA, though analyses from the ENCODE project suggest figures closer to 20 per cent or more.
Regulatory sequence identification originally depended on recombinant DNA techniques and later on comparative genomics, using evolutionary conservation to infer function. As many regulatory elements evolve rapidly, conservation-based approaches have limitations; current methodologies favour experiments such as chromatin immunoprecipitation sequencing (ChIP-seq) and mapping hypersensitive chromatin sites to locate active regulatory regions in specific cell types.
Repetitive DNA and Genome Architecture
Approximately 50 per cent of the human genome consists of repetitive sequences. These include:
- Tandem repeats, making up about 8 per cent of total DNA, with highly variable lengths used in applications such as forensic genetics.
- Microsatellites, typically repeats of fewer than ten nucleotides.
- Trinucleotide repeats, which can occur in coding regions and are implicated in genetic disorders such as Huntington’s disease.
- Mobile genetic elements, including transposons and retroelements, which have shaped genomic evolution and structure.
Such sequences contribute both to evolutionary flexibility and genomic instability, depending on their location and activity.
Functional Interpretation and Future Directions
Despite the complete nucleotide sequence being known, the human genome remains only partly understood. Significant challenges include determining the full functionality of noncoding DNA, deciphering the interplay between genes and regulatory elements, and mapping how epigenetic modifications influence gene expression.
Advances in pangenomics, epigenomic sequencing and single-cell analysis promise to refine understanding of human genetic diversity and the molecular basis of disease. The human genome thus represents not only a blueprint for human biology but also a dynamic research frontier at the intersection of genetics, evolution and medicine.