Forums | OpenR
OpenR: R & Statistics
/
A little question about CCA
24 March 2022, 5:56 PM
In order to understand better about CCA, I've spent time on the backgroup mathmatical mechanism and so far it has worked and solved most of my confusion in mind already. But I have still got a question that seems hasn't shown or solved in classmates' materials (maybe it's just because I didn't notice it, if so, feel free to point out, thanks!):
In short, we essentially want to have an equation set like:
ZX1 = -0.03824761xbill_length_mmxbill_length_mm+0.03157476xbill_depth_mmxbill_depth_mm
ZY1 = -0.05619966yflipper_length_mmyflipper_length_mm+0.00151493ybody_mass_g
which represents the combination in X set and Y set seperately. But I think we ought to have a final form of "Y = kX +b", while the above two equation are just still seperated in the form of "X=..., Y=...", and so far I can't see the posibility for it to be transformed to the normal format "Y = kX +b", though somehow ρ was successfully calculated using least square mathmatical method.
How can I combine both ZX1 & ZY1 into one equation and get the "k" & "b" in the final integrated regression? Thank you!
24 March 2022, 6:50 PM
Hello, I'm not sure I understand you correctly, but the process of CCA is to find the maximum linear correlation coefficients in the number set matrices of X and Y respectively, i.e. we start with the x or y coefficients and take the maximum value to calculate them, rather than needing to calculate the correlation functions of the two matrices together.
24 March 2022, 7:37 PM
Thank you very much, Lingxiao! But I think I mostly agree with your ideas, further more, I might want to figure out more relations between X set and Y set. Should it be sometimes wrong for my understanding, please let me know.
24 March 2022, 7:02 PM
By the formulas of linear combination, we don’t need to find b. This is a problem of a matrix calculation. What we need to do is find the coefficients on each of these x's in order for X=a1x1+a2x2+…+aixi. We should focus on understanding the results of CCA. For example, in exercise 2, we get xcoef and ycoef of the first column of canonical variates. We find that X1 and CBS are most relevant, because c has the largest coefficient (5.06). In penguins’ dataset, ZX1 = -0.03824761xbill_length_mmxbill_length_mm+0.03157476xbill_depth_mmxbill_depth_mm means canonical variate X1 depends on both bill length and bill depth because their absolute values are close. ZY1 = -0.05619966yflipper_length_mmyflipper_length_mm+0.00151493ybody_mass_g means canonical variate Y1 basically depends on flipper length because its coefficient is much bigger than the body mass.
24 March 2022, 7:55 PM
Thank you for your detailed explanation! Actually I think it's not the thing I was asking for, and I want to keep asking why k & b is not important here, what is the nature of these two parameters if they ought to exist. But your explanations still really help me better understand the concept and I agree most of your ideas except one partially.
1. We find that X1 and CBS are most relevant, because c has the largest coefficient (5.06). In penguins’ dataset, ZX1 =
I think it is not about relevant but the contribution because it's a sublevel contributor.
The last question is, what is the relationship between ZX1 & ZY1? All in all, I think one of the ultimate proposes is to bridge the relations between two "integreted concept set". And, though we need to find a maximum ρ between the two, a set of k & b ought to natually be there.
Should it be something wrong with my statement, please let me know. Thank you very much again.
24 March 2022, 8:35 PM
For example, in video CCA-TileStats, we aim to use SBP and DBP to represent BP and use weight and height to represent BS. Obviously, we don't need y=ax+b to figure out the relationship, so you don't have to worry about how do you get b. We never assumed y=ax+b.
If you want to find the relationship between ZX1 & ZY1, I think it is a dimensional reduction technique, so you calculate ∑−1XX∑XY∑−1YY∑YX and use eigen(matrix)$values for the eigenvalues and followed by sqrt(). Then, you can get some canonical correlations.
24 March 2022, 8:48 PM
In addition, in CCA, what we care about is the correlation between each set of typical variables. Sublevel contributors you said determine each canonical variable. Different sublevel contributors produce different canonical variables and it directly affects the correlation between the set of canonical variables, so it’s important.
24 March 2022, 8:58 PM
I think I understand what you are saying, if you were to calculate k and b exactly, it would be a normal regression procedure, because it would involve a large number of calculations for a multivariate set of data, and the use of CCA is to avoid calculating the exact values of k and b and thus simplify the regression.
24 March 2022, 8:54 PM
hello, do you mean that you want to perform linear regression using the x and y dataset? I think this is not the purpose of cca. cca aims to find the relationship between two high-dimensional datasets, instead of using one set to make predictions. About the ZX1 and ZY1, you may think it's not intuitive because x and y not appear in a same formula. For this problem, you may visit this website: https://online.stat.psu.edu/stat505/book/export/html/682, the detailed mathematic process are shown here, which may help you understand the canonical variate pair.
25 March 2022, 1:45 PM
Nice discussion thread! I would like to give all of you a thumb-up!