The present work is based on the paper by Alexander N. Gorban and Andrei Y. Zinovyev. The link to the paper.

This work is essentially a representation that how data visualisation can help in the genomic sequence analysis in affirming hypothesis or other phenomenon

We will analyse a small fragment of the DNA

The DNA nucleotide contains 4 bases A, T, G, C.
One distinctive message in a genomic sequence is a piece of text, called a gene.
It was one of many great discoveries of the twentieth century that biological information is encoded in genes by means of triplets of letters, called codons in the biological literature.

Problem Statement : Prove that the sequence of letters in the genomic sequence is not random and that the information in the text is encoded by non-overlapping triplets i.e. codons.

Data : Genomic Sequence of Covid-19

Imports and File reads

_Download the DNA sequences from NCBI site here_

import numpy as np
import itertools
import preprocessing
from sklearn import preprocessing
from sklearn.decomposition import PCA
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

path = "../input/gnomes/Gnomes"

def loadGenome(path):
    genome=""
    with open(path, "r") as f:
        _ = f.readline() # ignore the header
        l = f.readline()
        while l:
            genome += l[:-1] # ignore new line
            l = f.readline()
    genome_size = len(genome)
    return genome, genome_size

genome, genome_size = loadGenome(path+"/Covid.fasta")

The number of characters in the DNA fragment

genome_size

29499

Four bases viz. A, T, G, C

A - Adenine, T- Thiamine, G - Guanine, C - Cytosine

letters=list(set(genome))

letters

['C', 'G', 'A', 'T']

Define parameters

Split the genome into block_size length blocks, here 300

As there are only four letters, there are four possible words of length 1 (singlets)
16 = 4^2 possible words of length 2 (duplets), 64 = 4^3 possible words
of length 3 (triplets) and 256 = 4^4 possible words of length 4 (quadruplets).

block_size = 350
word_size = [1,2,3,4]
num_blocks = genome_size//block_size

num_blocks

84

Attempting to break the genome in blocks

blocks = []
start_idx = 0
for idx in range(num_blocks):
    end_idx = start_idx + block_size
    blocks.append(genome[start_idx:end_idx])
    start_idx = end_idx
    if (idx + 1 == num_blocks):
        blocks.append(genome[end_idx:genome_size])

Word Frequency

Number of possible words for each word size

def getFeatures(genome, num_blocks, word_size):
    features={x : np.zeros((num_blocks, 4**x)) for x in word_size}
    for ws in word_size:
        lookUp = { y : x for x,y in enumerate([''.join(l) for l in itertools.product(*[letters]*ws)]) }
        for b in range(num_blocks):
            block = genome[b*block_size:(b+1)*block_size]
            for i in range(block_size//ws):
                word = block[i*ws:(i+1)*ws]
                features[ws][b,lookUp[word]] += 1
    return features

features = getFeatures(genome, num_blocks, word_size)

Standardize the data using Standard Scaler

def standardize(features, word_size):
    for ws in word_size:
        std_scale = preprocessing.StandardScaler().fit(features[ws])
        features[ws] = std_scale.transform(features[ws])
    return features

features = standardize(features, word_size)

Principal Component Analysis

def runPCA(features, word_size, n_components=2):
    featuresPCA = {}
    for ws in word_size:
        pca = PCA(n_components=n_components).fit(features[ws])
        featuresPCA[ws] = pca.transform(features[ws])
    return featuresPCA

featuresPCA = runPCA(features, word_size)

from matplotlib.pyplot import figure
fig, axes = plt.subplots(2,2, figsize=(15, 15))
axes[0,0].scatter(featuresPCA[1].T[0], featuresPCA[1].T[1], s=5)
axes[0,1].scatter(featuresPCA[2].T[0], featuresPCA[2].T[1], s=5)
axes[1,0].scatter(featuresPCA[3].T[0], featuresPCA[3].T[1], s=5)
axes[1,1].scatter(featuresPCA[4].T[0], featuresPCA[4].T[1], s=5)

<matplotlib.collections.PathCollection at 0x7fbad8b88dd0>

PCA plots of word frequencies of different length. Third one shows the most structured distribution. The structure can be interpreted as the existence of a nonoverlapping triplet code

kmeans = KMeans(n_clusters=3, random_state=0).fit(featuresPCA[3])

kmeans.labels_

array([2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2,
       0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 0, 1, 2, 0, 1, 2,
       0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 0, 1, 2, 0, 1,
       2, 0, 1, 2, 0, 1, 1, 2, 0, 0, 1, 0, 1, 1, 1, 2, 0, 1], dtype=int32)

plt.figure(figsize=(10,10))
plt.scatter(featuresPCA[3].T[0], featuresPCA[3].T[1], s = 5, c = kmeans.labels_ )

<matplotlib.collections.PathCollection at 0x7fbad87d38d0>

Genomic text contains information that is encoded by non-overlapping triplets, because the plot corresponding to the triplets is evidently highly structured as opposed to the pictures of singlets, duplets and quadruplets.

TSNE Visuals

PCA loses a lot of information in compression, TSNE might be a better plot

import sklearn.manifold
tsne = sklearn.manifold.TSNE(n_components=2, random_state=0)

def runTSNE(features, word_size):
    featuresTSNE={}
    for ws in word_size:
        featuresTSNE[ws] = tsne.fit_transform(features[ws])
    return featuresTSNE

featuresTSNE = runTSNE(features, word_size)

/opt/conda/lib/python3.7/site-packages/sklearn/manifold/_t_sne.py:783: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  FutureWarning,
/opt/conda/lib/python3.7/site-packages/sklearn/manifold/_t_sne.py:793: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  FutureWarning,
/opt/conda/lib/python3.7/site-packages/sklearn/manifold/_t_sne.py:783: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  FutureWarning,
/opt/conda/lib/python3.7/site-packages/sklearn/manifold/_t_sne.py:793: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  FutureWarning,
/opt/conda/lib/python3.7/site-packages/sklearn/manifold/_t_sne.py:783: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  FutureWarning,
/opt/conda/lib/python3.7/site-packages/sklearn/manifold/_t_sne.py:793: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  FutureWarning,
/opt/conda/lib/python3.7/site-packages/sklearn/manifold/_t_sne.py:783: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  FutureWarning,
/opt/conda/lib/python3.7/site-packages/sklearn/manifold/_t_sne.py:793: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  FutureWarning,

from matplotlib.pyplot import figure
fig, axes = plt.subplots(2,2, figsize=(15, 15))
axes[0,0].scatter(featuresTSNE[1].T[0], featuresTSNE[1].T[1], s=5)
axes[0,1].scatter(featuresTSNE[2].T[0], featuresTSNE[2].T[1], s=5)
axes[1,0].scatter(featuresTSNE[3].T[0], featuresTSNE[3].T[1], s=5)
axes[1,1].scatter(featuresTSNE[4].T[0], featuresTSNE[4].T[1], s=5)

<matplotlib.collections.PathCollection at 0x7fbad8978fd0>

kmeans_tsne = KMeans(n_clusters=3, random_state=0).fit(featuresTSNE[3])

plt.figure(figsize=(10,10))
plt.scatter(featuresTSNE[3].T[0], featuresTSNE[3].T[1], s=2, c=kmeans_tsne.labels_)

<matplotlib.collections.PathCollection at 0x7fbad8978f10>

TSNE produced a similar plot as PCA

The present work is based on the paper by Alexander N. Gorban and Andrei Y. Zinovyev. The link to the paper.

This work is essentially a representation that how data visualisation can help in the genomic sequence analysis in affirming hypothesis or other phenomenon

We will analyse a small fragment of the DNA

The DNA nucleotide contains 4 bases A, T, G, C.

One distinctive message in a genomic sequence is a piece of text, called a gene.

It was one of many great discoveries of the twentieth century that biological information is encoded in genes by means of triplets of letters, called codons in the biological literature.

Problem Statement : Prove that the sequence of letters in the genomic sequence is not random and that the information in the text is encoded by non-overlapping triplets i.e. codons.

Data : Genomic Sequence of Covid-19

Imports and File reads

_Download the DNA sequences from NCBI site here_

The number of characters in the DNA fragment

Four bases viz. A, T, G, C

A - Adenine, T- Thiamine, G - Guanine, C - Cytosine

Define parameters

Split the genome into block_size length blocks, here 300

As there are only four letters, there are four possible words of length 1 (singlets)

16 = 4^2 possible words of length 2 (duplets), 64 = 4^3 possible words

of length 3 (triplets) and 256 = 4^4 possible words of length 4 (quadruplets).

Attempting to break the genome in blocks

Word Frequency

Number of possible words for each word size

Standardize the data using Standard Scaler

Principal Component Analysis

PCA plots of word frequencies of different length. Third one shows the most structured distribution. The structure can be interpreted as the existence of a nonoverlapping triplet code

Genomic text contains information that is encoded by non-overlapping triplets, because the plot corresponding to the triplets is evidently highly structured as opposed to the pictures of singlets, duplets and quadruplets.

TSNE Visuals

PCA loses a lot of information in compression, TSNE might be a better plot

TSNE produced a similar plot as PCA

End of Notebook