Mathematics desk | ||
---|---|---|
< December 2 | << Nov | December | Jan >> | December 4 > |
Welcome to the Wikipedia Mathematics Reference Desk Archives |
---|
The page you are currently viewing is an archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages. |
Kind of an esoteric question; I'm modeling something rare, about 1000 cases in about a million population. (SAS, proc genmod, poisson distribution). Intending to do the more or less routine thing; model sample of 1/3, run that model on second third to check for overfitting, use final third to validate/score. Works fine, but.... i sampled cases and controls separately then combined each (i.e. 1/3 of cases + 1/3 of controls), rather than taking 1/3 sample of total controls+cases together, because I was worried that 1/3 of the whole thing might get a small number of cases. Anyway, practicality aside, I'd like to hear any arguments re the theoretical validity or lack of same of the aforementioned separate sampling of controls and cases and combining? Thanks. Gzuckier ( talk) 04:46, 3 December 2012 (UTC)
The following is a heat map 2D histogram of X,Y values. It's a linear heat map. Magenta is the bisquare robust fit, cyan is the least squares fit. The robust fit makes the previous least squares fit only slightly better. How do I make it fit the trend? The plot is a log-log plot comparing log velocity (on the Y axis) against log of curvature (on the X-axis), where the data apparently follows a power law. However, using linear regression seems to obtain the wrong slope. John Riemann Soong ( talk) 05:18, 3 December 2012 (UTC)
I've also read on the internet the maximum likelihood estimator might be better than linear regression for power laws. However, this seems to be based on X-frequency rather than X-Y (velocity-curvature). Is there a way to use maximum likelihood estimators between two variables? Why are the discussions biased to frequency? John Riemann Soong ( talk) 05:36, 3 December 2012 (UTC)
I really can't draw the slopes myself. I have to measure the slopes between genotypes, and measure the differences. The background to this is that I have to distinguish between genotypes, which I have found to have statistically significantly different slopes. The separation is good, at least decent. I was trying to achieve an even better separation and see if the effect size was even greater by having the regression line actually fit the data. The data is log y = b log x + log K (to fit the power law y = K*x^beta). I modified the robust regression parameters to have a really low tuning constant (1.2) for the bisquare weighting algorithm. Here are some new histograms. (Not really important but I just realized the x-axis labels are wrong, R should be in mm, not rad/mm.)
By lowering the tuning constant, the slopes (and intercepts) found are slightly better, but they are still not optimally fit. The problem is that there seems to be a point where lowering the tuning constant doesn't make any difference, because robustfit() hits the iteration limit if I set the tuning constant below a certain point. If I set the iteration limit to 150 (changing it from 50), I still get roughly the same slopes. Which is fine, it still returns me valid slopes, but it appears that there is a limit to how much the slopes can be corrected.
John Riemann Soong (
talk) 01:27, 4 December 2012 (UTC)
You can see from the logarithmic histogram a better clearer picture about where the errors might be coming from. I did notice initially that the line seemed to fit the logarithmic histogram (where the frequency z is scaled logarithmically) whereas the linear scale hides the low frequency (but seemingly high-influencce) values. Fumin has this interesting subdistribution which is why I saved the image for my lab presentation this morning, the other genotypes doesn't. They all have a "long tail" (frequencywise) in the (x,y) coordinates "below" the line though (the tail falloff trend runs almost "normal" or "orthogonal" to the main trendline).
I think this is why robust regression might be having a hard time even with ridiculously low tuning constants because the slow dropoff of the long tail (away from the "median" or "mode") make them not seen as outliers so their their r (error/distance with respect to the last iteration) is still seen as low. How do I obtain a "median" line fit in MATLAB?
Specific to fumin: I note that the straight robust line in the "fumin" histogram goes through the main distribution, which is good, but is still somewhat influenced by the "smaller" subdistribution (the local frequency maximum in that region is unique to fumin), and this lowers its slope somewhat to lower than the other genotypes. Individually, each of the fumin flies' regressions have higher slopes than the other genotypes, within at least a 95% degree of confidence, if not more. John Riemann Soong ( talk) 01:35, 4 December 2012 (UTC)
Mathematics desk | ||
---|---|---|
< December 2 | << Nov | December | Jan >> | December 4 > |
Welcome to the Wikipedia Mathematics Reference Desk Archives |
---|
The page you are currently viewing is an archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages. |
Kind of an esoteric question; I'm modeling something rare, about 1000 cases in about a million population. (SAS, proc genmod, poisson distribution). Intending to do the more or less routine thing; model sample of 1/3, run that model on second third to check for overfitting, use final third to validate/score. Works fine, but.... i sampled cases and controls separately then combined each (i.e. 1/3 of cases + 1/3 of controls), rather than taking 1/3 sample of total controls+cases together, because I was worried that 1/3 of the whole thing might get a small number of cases. Anyway, practicality aside, I'd like to hear any arguments re the theoretical validity or lack of same of the aforementioned separate sampling of controls and cases and combining? Thanks. Gzuckier ( talk) 04:46, 3 December 2012 (UTC)
The following is a heat map 2D histogram of X,Y values. It's a linear heat map. Magenta is the bisquare robust fit, cyan is the least squares fit. The robust fit makes the previous least squares fit only slightly better. How do I make it fit the trend? The plot is a log-log plot comparing log velocity (on the Y axis) against log of curvature (on the X-axis), where the data apparently follows a power law. However, using linear regression seems to obtain the wrong slope. John Riemann Soong ( talk) 05:18, 3 December 2012 (UTC)
I've also read on the internet the maximum likelihood estimator might be better than linear regression for power laws. However, this seems to be based on X-frequency rather than X-Y (velocity-curvature). Is there a way to use maximum likelihood estimators between two variables? Why are the discussions biased to frequency? John Riemann Soong ( talk) 05:36, 3 December 2012 (UTC)
I really can't draw the slopes myself. I have to measure the slopes between genotypes, and measure the differences. The background to this is that I have to distinguish between genotypes, which I have found to have statistically significantly different slopes. The separation is good, at least decent. I was trying to achieve an even better separation and see if the effect size was even greater by having the regression line actually fit the data. The data is log y = b log x + log K (to fit the power law y = K*x^beta). I modified the robust regression parameters to have a really low tuning constant (1.2) for the bisquare weighting algorithm. Here are some new histograms. (Not really important but I just realized the x-axis labels are wrong, R should be in mm, not rad/mm.)
By lowering the tuning constant, the slopes (and intercepts) found are slightly better, but they are still not optimally fit. The problem is that there seems to be a point where lowering the tuning constant doesn't make any difference, because robustfit() hits the iteration limit if I set the tuning constant below a certain point. If I set the iteration limit to 150 (changing it from 50), I still get roughly the same slopes. Which is fine, it still returns me valid slopes, but it appears that there is a limit to how much the slopes can be corrected.
John Riemann Soong (
talk) 01:27, 4 December 2012 (UTC)
You can see from the logarithmic histogram a better clearer picture about where the errors might be coming from. I did notice initially that the line seemed to fit the logarithmic histogram (where the frequency z is scaled logarithmically) whereas the linear scale hides the low frequency (but seemingly high-influencce) values. Fumin has this interesting subdistribution which is why I saved the image for my lab presentation this morning, the other genotypes doesn't. They all have a "long tail" (frequencywise) in the (x,y) coordinates "below" the line though (the tail falloff trend runs almost "normal" or "orthogonal" to the main trendline).
I think this is why robust regression might be having a hard time even with ridiculously low tuning constants because the slow dropoff of the long tail (away from the "median" or "mode") make them not seen as outliers so their their r (error/distance with respect to the last iteration) is still seen as low. How do I obtain a "median" line fit in MATLAB?
Specific to fumin: I note that the straight robust line in the "fumin" histogram goes through the main distribution, which is good, but is still somewhat influenced by the "smaller" subdistribution (the local frequency maximum in that region is unique to fumin), and this lowers its slope somewhat to lower than the other genotypes. Individually, each of the fumin flies' regressions have higher slopes than the other genotypes, within at least a 95% degree of confidence, if not more. John Riemann Soong ( talk) 01:35, 4 December 2012 (UTC)