The Discussion
This project is based on a particular exchange with 'topmind' on
the newsgroups talk.origins. 
Zachriel: If there was a bit-map of the Mona Lisa within the human genome, there is a very high probability it would have been noticed because it would stick out like a statistical sore thumb.
topmind: I am very skeptical of that. Can you provide a demonstration?
Zachriel: Statistical methods used to analyze the human genome are more than capable of detecting intelligent
patterns, including a bitmap of the Mona Lisa.
topmind: Prove it.
topmind: I asked which prior searches would have found Mona Lisa bitmaps if they existed.
topmind: The algorithms to find such would have to be explicitly tuned.
topmind: the "test" suggested was bitmapped images.
topmind: Put your Mona where your mouth is!
And so on.
The Hypothesis
As an image is generally distinguished by having regions of
self-similarity, this should result in a localized statistical anomaly in a random sequence, and quite possibly in genomic data, as well.
Finding Mona will demonstrate that a bitmap image has
a very distinctive statistical footprint, and that nothing else within
the examined genome has this footprint. It uses an exceedingly simple
statistical test, yet is quite powerful.
The Data
The bitmap of the Mona Lisa was downloaded from the Louvre, then reduced to just 10x14 pixels
to provide the smallest discernable image, Tiny Mona .
The
E. coli genome was downloaded from the
E. coli Genome Project and is
composed of about 4 million nucleotide bases.
The Algorithm
Finding Mona divides a sequence up into a number of equal-length segments, determines the arithmetic mean of each segment, and identifies the segment that varies most from the global mean. The sequence can be a genome or random data. As a test, we can choose to randomly insert Tiny Mona
into the sequence.
Finding Mona consistently identifies Tiny Mona
as an anomaly, demonstrating that a conventional bitmap of the Mona Lisa
has a very distinctive statistical footprint, and that nothing else within the examined genome has this footprint.
Finding Mona uses an exceedingly simple statistical test, yet is quite powerful. This
result has been verified for a variety of different values for Base/Pixel and Base Assignment parameters.
ctrl-g to compute (go)
The Parameters
The algorithm runs rather slow, so the computation is divided into three parts.
You can limit the computation by avoiding the selection of certain
parameters.
HASH:
Takes the genome and converts it into numbers. Depends on the Base/Pixel and Base Assignment settings. The numerical encoding of the genome is done by treating each nucleotide base as a quaternary digit (base 4). You can set the number of digits per pixel from 2 to 4, but it defaults to three — the number of bases in a natural codon. You can also set the assignment of the bases to each quaternary numeral.
Can take several minutes.
BUILD: Reads the Hash into
arrays. Depends on the Sequence size parameter, as well as the Mona and Genome flags.
CALC: Sums the segments.
Depends on the number of Segments and the Length of the segments.
BASES
a, c, g, t : Each base is a quaternary
digit (numerical base-4), so takes two binary bits. Each base can be assigned a value from 0 to 3. Be be sure to set them uniquely.
BASE/PIXEL: The number of bases per color-pixel, from 2 to 4. Each color-pixel is a single byte in memory. If Base/Pixel is set to less than
four, the quaternary digits are set to the higher order binary bits of each byte.
SEQUENCE: The size of the Sequence to be considered.
MONA: If true, then Tiny Mona
will be hidden randomly within the sequence.
GENOME: If true, then the sequence will be the E. coli genome. If false, then the sequence will be filled with the appropriate random numbers.
SEGMENTS: The number of Segments (max 32000).
LENGTH: The Length of each segment.
REMEMBER:
If you change the light-yellow settings, it requires a simple
Calc.
If you change the bright-yellow settings, it requires a
Build.
But if you change the orange settings, beware, it requires a
Hash.
NOTE: When
Finding Mona first runs, it has to
Build the arrays, but
already has a default Hash
table. After that, you can
Calc with the Segment
length and number without having to Build
or Hash.
ctrl-g to compute (go)
The Results
The algorithm determines the segment which varies the
most from the global mean. It then decides whether it overlaps the
randomly placed Tiny Mona
. If it does,
it notes it as a match.
The individual segment averages are listed in column-G.
The segment averages in the vicinity of a successful match
are in column-H & I. 
The Conclusion
There are an infinite number of ways of encoding an image. There are an infinite number of algorithms for detecting an image. There are an infinite number of possible false positives, depending on the algorithm. However, we can say with some certainty, that standard statistical methods would
easily discover a conventional bitmap of the Mona Lisa, and that no such image exists within the genome examined.
|