01-01-2010 : Varselec 1) The CD data for analysis should be one single column of delta epsilon ( NOT molar ellipticity) values (starting with short wavelength value) with a title above the column. In calculating delta epsilon values, the concentration of amide bonds, NOT the protein, should be used. All your data are transform into epsilon value and saved as a b.dta file that should look like the following example: Title 0.1 0.2 0.3 . . . 0.0 0.0 The first number after the title is the delta epsilon value at 178 nm and the zero at the end of the file is the delta epsilon value at 260 nm. 2) The COMCTL.DTA file, is automatically generated and should look like the following example: 33 31 1 528 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 (First line 2I5, 2I7 format and second and third lines I3 format ( only 25 proteins in the second line)) In the first line, number 33 is the starting number of proteins in the basis set (N), number 31 is the final number of proteins in the basis set (n), number 1 is the starting number of combination (s), and number 528 is the final number of combinations. The equation to calculate the number of combinations is N! divided by (n!(N-n)!) = 33! = 528 31!(33-31)! The second and third lines in the comctl.dta file shows the proteins in the basis set for the current analysis 1 through 33). 3) If your CD data have been recorded from 260 to 178 nm at a 1.0 nm interval (83 data points), you are ready to run the program. 4) If your CD data have not been recorded from 260 to 178, the N.DTA that specifies the starting and ending wavelengths and number of data points in your CD file is automatically generated.The S.DTA (CD of 33 basis proteins) is automatically adjusted so as to truncate the protein basis CD spectra to the same number of data points as in your CD data. The original file provided with the Johnson program has been renamed S.ORI file. However, to analyze for 5 structures (alpha helix, anti-parallel beta sheet, parallel beta-sheet, beta-turn, and others), your CD data should have at least 5 significant pieces of information e.g., from 260-184 nm ( Manavalan and Johnson 1985, J. Biosci. Suppl.8, 141-149). Please do not use this program if your CD spectra have not been recorded to at least 184 nm. After a few seconds, the number 1 will appear on your computer screen, followed by 2 and 3, ... until the final number of combinations (step #2) is reached. However, for the reasons explained in the step #9, we suggest you begin by running the program with all the proteins in the basis set to create only 1 combination by editing the COMCTL.DTA file as follows: 33 33 1 1 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 7) The output file for the varslc1 program is called COMB2.OP The second program, called SEARCH, is used to choose the combinations which meet the criteria from comb2.op. It has not been included into the DICRPROT package. The criteria are chosen according to a file called SRCTL.DTA. 8) Criteria for choosing acceptable combinations are the following: A) the total of all five secondary structures should be between 0.96 and 1.05. B) negative secondary structures may not be less than -0.05; and these negative numbers are assumed to be zero. C) the fraction of alpha-helix (H) should not change significantly from the original analysis with all the basis proteins. D) the calculated CD spectrum should be reasonable fit to the measured CD spectra (a root-mean-square residual of about 0.2 delta epsilon units), but this is a secondary criterion; and may be larger for proteins with intense CD spectra. E) combinations that meet the first four criteria and eliminate the fewest proteins from the basis set are averaged to give the reported analysis. 9) Edit SRCTL.DTA FILE to specify the criteria for the desired combinations. The SRCTL.DTA file looks like the following example: 0.96 1.05 0.22 -0.05 83 (0.96=lower value for total, 1.05=higher value for total, 0.22=RMSE, -0.05=largest negative value allowed, 83=number of data points). The criteria for acceptable combinations are somewhat different from what we reported earlier and represent an improvement. We had suggested root-mean-square (RMSE) values equal to or less than 0.22 delta epsilon for acceptable combinations in using the variable selection method. However, throughout our experience in using this method we observed repeatedly that restricting the RMSE value to less than 0.22 sometimes leads to the wrong predictions. The restriction of RMSE values was a concern of another group (van Stokkum, et al. 1990. Anal. Biochem.191,110-118) as well. We suggest that priority should be given to the total of secondary structures. Therefore, we modify our criteria for acceptable combinations as follows: First, analyze the CD with all proteins included in the basis set to create only one combination, to see if there are any negative values in the secondary structure prediction. If there are no negative values, then we exclude all combinations with negative values in our acceptable combinations in subsequent variable selection; otherwise in the end values of no more than -0.05 are allowed. Also, as I mentioned earlier, the fraction of alpha-helix (H) predicted from this 1 combination should not change significantly in subsequent combinations from variable selection analysis. Second, run variable selection removing various combinations of proteins from the basis set ( for example remove 3 proteins, step #2 to create 5456 combinations) and choose the combinations with \"TOTAL\" secondary structures between 0.96 - 1.05, regardless of their RMSE values (e.g. choose RMSE=0.4- 0.6 to start). Then run the SEARCH program to choose all the combinations in the COMB2.OP which meet the criteria. The output of SEARCH program is called SEARCH.OP. If there are many such combinations in the SEARCH.OP file, narrow the range for the total and choose the ones closest to 1.00 so you have about twenty combinations in the SEARCH.OP. Then begin narrowing the RMSE values in the SRCTL.DTA and run SEARCH program, until you have about ten combinations that have a total close to 1.00 and with the smallest RMSE values. If there were no combinations which satisfy the criteria, then you will have to edit COMCTL.DTA file (step #2) to eliminate more proteins from the basis set. You could remove 4 at a time to create 40920 combinations. This will take days, depending on the computer, to complete 40920 combinations. In this case, start the analysis with a lesser number of proteins in the basis set by removing 1 protein from the basis set, whose removal consistently created the best total in the previous run. Therefore, we you will have 32 proteins in the basis set instead of 33. How you decide which protein to remove from the basis set?: run the SEARCH program (on the 5456 combinations for example) to create about 20-30 combinations which are the closest to the criteria but do not necessarily meet the criteria. Then, choose a protein which is missing from the most number of these combinations. For example, let us say protein number 29 is missing in most of the combinations chosen by the SEARCH program. Then just delete the number 29 from the second line in the comctl.dta file. Now you have only 32 proteins to start with. If you remove 3 proteins at a time, then you will have 4960 combinations according to the above equation. Now the file COMCTL.DTA should look like the following: 32 29 1 4960 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 19 20 21 22 23 24 25 26 27 28 30 31 32 33 In our example there were 14 combinations with \"total\" of 0.97 - 1.04, \"rmse\" value of 0.13 delta epsilon. In your case, if you still do not have any desired combinations, repeat these steps until you get the combinations which meet the acceptable criteria. |
1 user(s) connected 296698 visits |