1 2 3 Previous Next 36 Replies Latest reply on May 3, 2011 10:28 AM by philmodjunk

    Help! Stuck with patternsearch



      Help! Stuck with patternsearch


      Hi all,

      I try to calculate something for which I need to count the number of doublets (like AA, AC, TT etc) in a string 

      For example ACGATTGC gives AC CG GA AT TT TG and GC

      My problem arises when there is more than two repeats: TTT should be two occurences of TT but patternsearch will only see one.

      TTTT should be three occurences, but patternsearch only sees two.


      Is there a variation of patternsearch which can do this?



        • 1. Re: Help! Stuck with patternsearch

          Hi Markus,

          If identical 2-character references should be counted as two then I see it as:

          ABCDEFTTT = 8


          Is this right?  If so then  Length ( text )  - 1 will work, I think.

          • 2. Re: Help! Stuck with patternsearch

            Thanks, but I need the number of each doublet, not the number of doublets, so 1x AA, 3x CC, 1x GC etc

            Alternatively if I could turn ACGATTGC into AC CG GA AT TT TG GC I could use the normal patternsearch, but for this I need kind of a for...next loop, and I haven't figured out how to do that.

            • 3. Re: Help! Stuck with patternsearch

              Are you needing the doublet count for a specified pair or a report that provides a break down of all existing pairs?

              • 4. Re: Help! Stuck with patternsearch

                the count for all doublets in a given string (which can be different in length  - usually 18 to 30 - and sequence)

                • 5. Re: Help! Stuck with patternsearch

                  We really need to understand the context and purpose here.  What does this string represent? Laughing

                  • 6. Re: Help! Stuck with patternsearch

                    Not sure how to help here so please ignore my input until you have reached an answer...but I can't stand it, I have to ask, just too curious.

                    What the heck would you use a calculation like that for? LOL

                    • 7. Re: Help! Stuck with patternsearch

                      From past posts and biology clases, I know these are nucleic acid pairs Adenine, Thymine, Guanine and Cytosine. Wink

                      This is something you may want to tackle with a script that loops through your text to produce a breakdown of each possible pairing for a given strand. It could also be done with a recursive custome function or a recursive calculation field, but the script may be a more convenient option for presenting the results.

                      Knowing just a bit more about what you need to do with the results of such an analysis might easily tilt us toward one or the other of these options here.

                      • 8. Re: Help! Stuck with patternsearch

                        A recursive custom function could handle it as well.  But many times this type of calculation (that we rarely get requests for) indicates there are easier ways to handle it. Smile

                        • 9. Re: Help! Stuck with patternsearch

                          We have a database of oligos (short pieces of DNA)

                          Certain properties (like length, molecular weight, melting temperature, and extinction coefficient) need to be calculated automatically from the sequence.

                          The best method is based on nearest-neighbour calculations and requires to know how many doublets of each type there are, so how many (in the simplified version) doublets of

                          AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT 

                          Note that something like TTT are TWO TT doublets (neighbour pairs)

                          Each doublet has a value associated with it (e.g. AA is 15400, AC is 12400 etc) and the sum over all neighboring pairs is required for the calculation - that's the easy part then, getting the number of neighbor pairs (for each doublet) seems to be the hard part.

                          P.S. This is being calculated on the fly as one enters the sequence as well as when a record opens as we have over 4000 records already in the database - scientists are notoriously lazy in entering values in a database that are not specifically useful to themselves.

                          • 10. Re: Help! Stuck with patternsearch

                            Ok, but what form do the results take here? In many ways, it seems you need something similar to a summary report that lists all possible pairs with total counts for each. I realize that report is not something one would "generate on the fly while entering the sequence", but I'm trying to visualize what you want to see for your results here.

                            If LaRetta or anyone can suggest a better way, I'm all ears, but I'm leaning towards a scripted method that would use the ONObjectKeystroke trigger that would process each addition and deletion of a letter to generate (and delete) records from a related table where you have one record for each possible pairing. A calculation (possibly a recursive custom function or recusive calculation field) could then be performed/evaluated from each related record to report a pair count for that specific pair record.

                            • 11. Re: Help! Stuck with patternsearch

                              "Each doublet has a value associated with it (e.g. AA is 15400, AC is 12400 etc)"

                              Where are these values stored?

                              • 12. Re: Help! Stuck with patternsearch

                                No. You are thinking too "database". This is a calculation triggered by the keystroke in the sequence field. It is not about making a report or deleting records - it just has to calculate a value and put it into the field "molar extension coefficient" based on the string of letters entered in the field "sequence". And for that I need to figure out how many doublets each isthere are in the sequence.

                                If you read over what I wrote then you will see that it is explained clearly and with examples.I even mention two possible ways to deal with it that I can see:

                                (a) count the doublets in some form of loop

                                (b) turn the sequence into a form that patterncount can deal with (like turning CGATTGC into AC CG GA AT TT TG GC as mentioned above), again needing a loop

                                This is something that is VERY easy to do in any programming language that I know (Basic, C, Python, Pascal) by using a for next loop and a variable - FileMaker seems to struggle badly with something that should be simple.

                                • 13. Re: Help! Stuck with patternsearch

                                  As I said, I was using the "report" as an analogy to understand the format you needed for your result(s).

                                  I'd still use a table of related records as a way to better support the calculation that you want. It's not the only way to do this by far, but it can be one way to keep from counting the same base pair more than once.

                                  The "Delete" records reference was an attempt to keep in mind that editing the sequence to remove one or more letters might necessitate deleting a record from the related table if that results in a count of zero for that record's base pair--something not strictly necessary for your calculation, but would help keep the set of related records uncluttered.

                                  LaRetta's question is a good one as a table of values for each pair could be very useful for setting up this calculation.

                                  • 14. Re: Help! Stuck with patternsearch

                                    Why would one want to know where the values are stored? It doesn't matter to the problem. But if you MUST know the calculation then looks something like

                                    epsilon =

                                         (PatternCount( Sequence cleaned up; "AA")  *  13700) 

                                     +  (PatternCount( Sequence cleaned up; "AC")  *  10600)

                                     +  (PatternCount( Sequence cleaned up; "AG")  *  12500) 

                                     +  (PatternCount( Sequence cleaned up; "AT")  *  11400)

                                    etc etc

                                    The problem is that patterncount doesn't deal with multiples like TTT or AAAAAAAAGGGCCCCCTTT in a way that would give the right answer for this particular problem.

                                    1 2 3 Previous Next