If identical 2-character references should be counted as two then I see it as:
ABCDEFTTT = 8
Is this right? If so then Length ( text ) - 1 will work, I think.
Thanks, but I need the number of each doublet, not the number of doublets, so 1x AA, 3x CC, 1x GC etc
Alternatively if I could turn ACGATTGC into AC CG GA AT TT TG GC I could use the normal patternsearch, but for this I need kind of a for...next loop, and I haven't figured out how to do that.
Are you needing the doublet count for a specified pair or a report that provides a break down of all existing pairs?
the count for all doublets in a given string (which can be different in length - usually 18 to 30 - and sequence)
Not sure how to help here so please ignore my input until you have reached an answer...but I can't stand it, I have to ask, just too curious.
What the heck would you use a calculation like that for? LOL
This is something you may want to tackle with a script that loops through your text to produce a breakdown of each possible pairing for a given strand. It could also be done with a recursive custome function or a recursive calculation field, but the script may be a more convenient option for presenting the results.
Knowing just a bit more about what you need to do with the results of such an analysis might easily tilt us toward one or the other of these options here.
We have a database of oligos (short pieces of DNA)
Certain properties (like length, molecular weight, melting temperature, and extinction coefficient) need to be calculated automatically from the sequence.
The best method is based on nearest-neighbour calculations and requires to know how many doublets of each type there are, so how many (in the simplified version) doublets of
AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT
Note that something like TTT are TWO TT doublets (neighbour pairs)
Each doublet has a value associated with it (e.g. AA is 15400, AC is 12400 etc) and the sum over all neighboring pairs is required for the calculation - that's the easy part then, getting the number of neighbor pairs (for each doublet) seems to be the hard part.
P.S. This is being calculated on the fly as one enters the sequence as well as when a record opens as we have over 4000 records already in the database - scientists are notoriously lazy in entering values in a database that are not specifically useful to themselves.
Ok, but what form do the results take here? In many ways, it seems you need something similar to a summary report that lists all possible pairs with total counts for each. I realize that report is not something one would "generate on the fly while entering the sequence", but I'm trying to visualize what you want to see for your results here.
If LaRetta or anyone can suggest a better way, I'm all ears, but I'm leaning towards a scripted method that would use the ONObjectKeystroke trigger that would process each addition and deletion of a letter to generate (and delete) records from a related table where you have one record for each possible pairing. A calculation (possibly a recursive custom function or recusive calculation field) could then be performed/evaluated from each related record to report a pair count for that specific pair record.
"Each doublet has a value associated with it (e.g. AA is 15400, AC is 12400 etc)"
Where are these values stored?
No. You are thinking too "database". This is a calculation triggered by the keystroke in the sequence field. It is not about making a report or deleting records - it just has to calculate a value and put it into the field "molar extension coefficient" based on the string of letters entered in the field "sequence". And for that I need to figure out how many doublets each isthere are in the sequence.
If you read over what I wrote then you will see that it is explained clearly and with examples.I even mention two possible ways to deal with it that I can see:
(a) count the doublets in some form of loop
(b) turn the sequence into a form that patterncount can deal with (like turning CGATTGC into AC CG GA AT TT TG GC as mentioned above), again needing a loop
This is something that is VERY easy to do in any programming language that I know (Basic, C, Python, Pascal) by using a for next loop and a variable - FileMaker seems to struggle badly with something that should be simple.
As I said, I was using the "report" as an analogy to understand the format you needed for your result(s).
I'd still use a table of related records as a way to better support the calculation that you want. It's not the only way to do this by far, but it can be one way to keep from counting the same base pair more than once.
The "Delete" records reference was an attempt to keep in mind that editing the sequence to remove one or more letters might necessitate deleting a record from the related table if that results in a count of zero for that record's base pair--something not strictly necessary for your calculation, but would help keep the set of related records uncluttered.
LaRetta's question is a good one as a table of values for each pair could be very useful for setting up this calculation.
Why would one want to know where the values are stored? It doesn't matter to the problem. But if you MUST know the calculation then looks something like
(PatternCount( Sequence cleaned up; "AA") * 13700)
+ (PatternCount( Sequence cleaned up; "AC") * 10600)
+ (PatternCount( Sequence cleaned up; "AG") * 12500)
+ (PatternCount( Sequence cleaned up; "AT") * 11400)
The problem is that patterncount doesn't deal with multiples like TTT or AAAAAAAAGGGCCCCCTTT in a way that would give the right answer for this particular problem.