Essential Gene Classification from DNA Sequences
This project implements a baseline machine learning pipeline to classify bacterial genes as essential or non-essential using DNA sequence information from the macwiatrak/bacbench-essential-genes-dna dataset (Hugging Face Datasets).
Project Overview
The notebook:
- Loads the BacBench essential genes dataset (train/validation/test splits).
- Cleans and simplifies the dataset by removing unused metadata columns.
- Encodes DNA sequences into integer representations using a custom nucleotide mapping.
- Extracts non-overlapping 4-mer (length-4 subsequence) count features.
- Trains a Logistic Regression classifier on the resulting feature vectors.
- Evaluates model performance using accuracy and F1 score on validation and test splits.
This serves as a simple, fast baseline for essential-gene prediction from raw DNA sequences.
Dataset
The project uses the macwiatrak/bacbench-essential-genes-dna dataset loaded via datasets.load_dataset.
Each split (train, validation, test) originally contains, among others, the following fields:
dna_seq: DNA sequence of the gene.essential: Label indicating whether a gene is essential ("Yes"or"No").- Several metadata columns (e.g.,
genome_name,start,end,protein_id,strand,product,__index_level_0__).
In this notebook, the unnecessary metadata columns are dropped, and only dna_seq and essential are retained for modeling.
Preprocessing
Key preprocessing steps:
- Label encoding
Theessentialfield is converted from string to integer:"Yes"→1"No"→0
- DNA character mapping
Each base indna_seqis mapped to an integer to prioritize efficiency:A → 0,T → 1,C → 2,G → 3- Ambiguous bases:
N → 4,K → 5,R → 6,S → 7,Y → 8,M → 9,W → 10
- Sequence encoding
A helper function converts each DNA string into a list of integers using the mapping above, discarding characters not present in the map.
Feature Extraction
The feature representation is based on non-overlapping 4-mers:
- The number of possible symbols is
NUM_BASES = 11. - The total number of distinct 4-mers is
NUM_4MERS = 11^4 = 14641. - For each encoded sequence, the notebook:
- Iterates with step size
STEP = 4to form non-overlapping 4-mers. - Maps each 4-mer to a unique integer index using positional encoding: [ \text{kmer_int} = b_0 \cdot 11^3 + b_1 \cdot 11^2 + b_2 \cdot 11 + b_3 ]
- Increments the corresponding position in a length-14641 count vector.
- Iterates with step size
The resulting dense feature matrix is then converted to a SciPy CSR sparse matrix for memory efficiency.
Model
The classification model is a Logistic Regression from sklearn.linear_model with:
solver='saga'max_iter=2000n_jobs=-1(parallel training where possible)
Training is performed on the 4-mer count features of the train split.
Evaluation
Model performance is evaluated on both validation and test splits using:
- Accuracy (
sklearn.metrics.accuracy_score) - F1 Score (
sklearn.metrics.f1_score)
The notebook prints:
- Validation Accuracy
- Validation F1 Score
- Test Accuracy
- Test F1 Score
These metrics provide an initial benchmark for this simple 4-mer + Logistic Regression approach.
Requirements
Main Python dependencies:
pandasnumpyscipydatasets(Hugging Face Datasets)scikit-learn
Example installation (if running locally):
pip install pandas numpy scipy datasets scikit-learn
How to Run
- Open the notebook in Google Colab or your preferred environment.
- Ensure all required packages are installed.
- Run the cells in order:
- Dataset loading and column filtering
- Label encoding
- DNA mapping and sequence encoding
- 4-mer feature extraction
- Model training
- Evaluation on validation and test splits
Possible Extensions
- Use overlapping k-mers or different k-mer sizes to capture more sequence context.
- Try more expressive models (e.g., tree-based methods, neural networks).
- Explore alternative encodings (e.g., one-hot, embeddings, or biologically informed encodings).
- Add cross-validation and hyperparameter tuning for more robust performance estimates.
Issues with Current Gene Classifier
- Class Imbalance
- Essential genes (
1) are much rarer than non-essential genes (0). - Logistic Regression tends to predict the majority class, lowering F1 score on validation.
- Essential genes (
- Simple Features
- Using non-overlapping 4-mer counts loses many sequence patterns.
- Linear combinations of k-mer counts may not capture complex dependencies between nucleotides.
- Non-Overlapping k-mers
- Step size of 4 skips many overlapping patterns in the DNA sequence.
- Important motifs or codon patterns might be missed.
- Normalization
- Raw 4-mer counts vary with sequence length.
- Longer sequences dominate the feature vectors, potentially biasing the classifier.
- Linear Model Limitations
- Logistic Regression is a linear classifier.
- Cannot capture non-linear interactions between k-mers that may be biologically relevant.
- Potential Data Leakage
- Some sequences in train/test splits may be very similar or overlapping.
- This can inflate test accuracy artificially, as seen in the high test F1 compared to validation.
- Limited Biological Context
- Only nucleotide sequences are considered.
- Other biological features (gene location, GC content, protein info) are ignored, which may be predictive of essentiality.
- Sparse Signal
- Many 4-mer combinations may never appear, making feature vectors sparse.
- Sparse linear models may struggle to generalize with limited data for certain patterns.
- Mapping
- I did not take into account whether W which is mapped to 10 will be treated as 10 or 1 and 0 which would essentialy derail the classification
Model Evaluation
- I did not take into account whether W which is mapped to 10 will be treated as 10 or 1 and 0 which would essentialy derail the classification
The baseline Logistic Regression classifier was evaluated on the validation and test splits using accuracy and F1 score:
| Split | Accuracy | F1 Score |
|---|---|---|
| Validation | 0.45 | 0.25 |
| Test | 0.90 | 0.80 |
⚠️ Note:
- Validation F1 is low due to class imbalance and simple linear model.
- The high test metrics may be artificially inflated if some sequences are very similar across splits.
- This baseline serves as a starting point for further improvements.
Credits
- Dataset: BacBench Essential Genes DNA Dataset by Mac Wiatrak et al., hosted on HuggingFace.
- Libraries & Tools:
- HuggingFace
datasetsfor data loading and preprocessing - NumPy for numerical operations
- SciPy for scientific computing
- scikit-learn for machine learning models and evaluation metrics
- HuggingFace
- Inspired by: Standard bioinformatics workflows for DNA k-mer feature extraction and baseline classification.
-Workflow & Model Implementation: Done by Sharat DoddihalNote
This was my first attempt at creating a Ml model by myself without too much use from AI.AI has been used here but only for helping with the debugging process. Overall I am happy with how this turned as this was a great learning experience.There are many fundamental errors that mess with the accuracy.