THANK YOU FOR HELPING ME CpG islands or CG islands are genom

THANK YOU FOR HELPING ME

CpG islands or CG islands are genomic regions that contain a high frequency of C and G nucleotides. They are often found in gene promoters and are important for regulation of gene expression.

1. Construct a model (i.e. draw a picture) representing an HMM for finding CpG islands across the genome which includes hidden states, emission states, transition probabilities, and emission probabilities. While the exact transition and emission probabilities are not known (you can make up numbers), the probabilities in your model should generally reflect the biology of CpG islands. Explain the reasoning behind your choice of probabilities. (Hint: When determining the states in your HMM, think about what it means to be inside or outside a CpG island. This can be represented by a very simple HMM.For the transition probabilities between hidden states, do additional research and find out what percentage of the genome is estimated to be in a CpG island. This percentage or frequency should at some level be reflected in your transition probabilities. )

2. In HMM models, the transition probabilities leaving a particular hidden state should always add up to 1 (i.e. the probabilities of the arrows going out from a state should add up to 1). Is this the same for the probabilities entering a particular state? In other words, do the probabilities of the arrows pointing towards a state add always add up to 1? Can you explain this?

Solution

The frequency of pairs of nucleotides C-followed-by-G (CpG) is higher in a CpG island than elsewhere. Our HMM should be able to distinguish between “being within a CpG Island” and “not being within a CpG island”. The observations of the HMM should be the observed DNA sequence (i.e., a string of nucleotides (A,T,C,G)). To use just two hidden states would be too simplistic because then we could not specify probabilities for one base to follow another. Instead, we would like the transiton probabilities to reflect how often one nucleotide follows another (separately for inside and outside CpG islands). That means to define the emission alphabet A = {A, T, C, G} and the state space S = {A, T, C, G, a, t, c, g} where a capital letter corresponds to being inside a CpG island and a lower-case letter corresponds to being outside of a CpG island. The states emit their corresponding letters. That is state “A” and state “a” both always emit the nucleotide “A”, etc. That makes                    

                                                   A T C G

the emission probabilities: B = A   1 0 0 0

                                              T   0 1 0 0

                                              C 0 0 1 0

                                              G 0 0 0 1

                                              a 1 0 0 0

                                              t 0 1 0 0

                                              c 0 0 1 0

                                              g 0 0 0 1

The transition probability matrix is 8 × 8 and has the following general structure:

P = (1 r)P1    (r/4)J

       (q/4)J          (1 q)P2

where the matrix J is the 4 × 4 matrix of all-1-entries. P1 and P2 are the transition probability matrices inside and outside of CpG islands, respectively

THANK YOU FOR HELPING ME CpG islands or CG islands are genomic regions that contain a high frequency of C and G nucleotides. They are often found in gene promot

Get Help Now

Submit a Take Down Notice

Tutor
Tutor: Dr Jack
Most rated tutor on our site