How do I best weigh the commonality between sets weighted to the size of the sets I have about 350 online petitions, each of which has between 250 and 25,000 signatures. For any two petitions, I can easily measure how...--prophetes.ai

How do I best weigh the commonality between sets weighted to the size of the sets I have about 350 online petitions, each of which has between 250 and 25,000 signatures. For any two petitions, I can easily measure how many individual signatories have signed both of them. I want to analyze the commonality between two petitions based on the number of common signatures, but I don't know the best way to weight total signatures. The most obvious way is: # a and b are sets of signature UIDs for two petitions len(a.intersect(b)) / (len(a) + len(b)) But this does not seem to work well for comparing the petitions when one has a small number and one has a large number. Is there a better way to weight the denominator? Maybe sum of the log of the lens? I don't care about the absolute value of the measurement, just that it's relative to all others.

If the total population is $n$, then we'd expect the intersection to have size $\frac{|a|\cdot|b|}{n}$ for independent (uncorrelated) sets. Thus an intersection much bigger than this would indicate a high positive correlation, a much smaller intersection would indicate negative correlation (like, in extreme, a "pro" and a "contra" petition). If we don't know $n$ (should we take the worldwide population? The national population? Or simply all those who have signed at least one petition?), we can still use $$\tag1 \frac{|a\cap b|}{|a|\cdot|b|} $$ as a relative measure. (In fact, if you have some $a$, $b$ that you have reason to believe are uncorrelated, the reciprocal of $(1)$ gives you an interesting way to estimate $n$)