Skip to the content.

Essential Gene Classification from DNA Sequences

This project implements a baseline machine learning pipeline to classify bacterial genes as essential or non-essential using DNA sequence information from the macwiatrak/bacbench-essential-genes-dna dataset (Hugging Face Datasets).

Project Overview

The notebook:

This serves as a simple, fast baseline for essential-gene prediction from raw DNA sequences.

Dataset

The project uses the macwiatrak/bacbench-essential-genes-dna dataset loaded via datasets.load_dataset.
Each split (train, validation, test) originally contains, among others, the following fields:

In this notebook, the unnecessary metadata columns are dropped, and only dna_seq and essential are retained for modeling.

Preprocessing

Key preprocessing steps:

Feature Extraction

The feature representation is based on non-overlapping 4-mers:

The resulting dense feature matrix is then converted to a SciPy CSR sparse matrix for memory efficiency.

Model

The classification model is a Logistic Regression from sklearn.linear_model with:

Training is performed on the 4-mer count features of the train split.

Evaluation

Model performance is evaluated on both validation and test splits using:

The notebook prints:

These metrics provide an initial benchmark for this simple 4-mer + Logistic Regression approach.

Requirements

Main Python dependencies:

Example installation (if running locally): pip install pandas numpy scipy datasets scikit-learn

How to Run

  1. Open the notebook in Google Colab or your preferred environment.
  2. Ensure all required packages are installed.
  3. Run the cells in order:
    • Dataset loading and column filtering
    • Label encoding
    • DNA mapping and sequence encoding
    • 4-mer feature extraction
    • Model training
    • Evaluation on validation and test splits

Possible Extensions

  1. Class Imbalance
    • Essential genes (1) are much rarer than non-essential genes (0).
    • Logistic Regression tends to predict the majority class, lowering F1 score on validation.
  2. Simple Features
    • Using non-overlapping 4-mer counts loses many sequence patterns.
    • Linear combinations of k-mer counts may not capture complex dependencies between nucleotides.
  3. Non-Overlapping k-mers
    • Step size of 4 skips many overlapping patterns in the DNA sequence.
    • Important motifs or codon patterns might be missed.
  4. Normalization
    • Raw 4-mer counts vary with sequence length.
    • Longer sequences dominate the feature vectors, potentially biasing the classifier.
  5. Linear Model Limitations
    • Logistic Regression is a linear classifier.
    • Cannot capture non-linear interactions between k-mers that may be biologically relevant.
  6. Potential Data Leakage
    • Some sequences in train/test splits may be very similar or overlapping.
    • This can inflate test accuracy artificially, as seen in the high test F1 compared to validation.
  7. Limited Biological Context
    • Only nucleotide sequences are considered.
    • Other biological features (gene location, GC content, protein info) are ignored, which may be predictive of essentiality.
  8. Sparse Signal
    • Many 4-mer combinations may never appear, making feature vectors sparse.
    • Sparse linear models may struggle to generalize with limited data for certain patterns.
  9. Mapping
    • I did not take into account whether W which is mapped to 10 will be treated as 10 or 1 and 0 which would essentialy derail the classification

      Model Evaluation

The baseline Logistic Regression classifier was evaluated on the validation and test splits using accuracy and F1 score:

Split Accuracy F1 Score
Validation 0.45 0.25
Test 0.90 0.80

⚠️ Note:

Credits