Skip to content

ENH Remove count DataFrame from calculate_cooks#292

Merged
BorisMuzellec merged 27 commits intoscverse:mainfrom
asistradition:main
Jul 2, 2024
Merged

ENH Remove count DataFrame from calculate_cooks#292
BorisMuzellec merged 27 commits intoscverse:mainfrom
asistradition:main

Conversation

@asistradition
Copy link
Contributor

What does your PR implement? Be specific.

calculate_cooks casts normed_counts into a pandas DataFrame for robust_method_of_moments_disp. This is memory inefficient for large data.

robust_method_of_moments_disp has been refactored to accept an ndarray directly and the DataFrame has been removed. There is no numerical change as a result.

Copy link
Collaborator

@umarteauowkin umarteauowkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @asistradition , thanks a lot for this PR. I was wondering how important it was to you that we put _mu_LFC and _hat_diagonals in the obsm and not in the layers. I agree it makes sense not to store useless nan values. However, fundamentally, these matrices are more layers than simply obsm (we always have this issue between objects restricted to non zero genes and objects defined on all genes, ideallly we would like to have a layers field restricted to non zero genes but we don't). However, if you experience significant memory differences by keeping it in the layers, I will accept !

@asistradition
Copy link
Contributor Author

The main advantage to using .obsm is that column slicing a row-major array requires a copy, and there is a considerable amount of overhead when calling .layers[key][:, filtered_genes] repeatedly.

As those keys are only used in cooks_distances it is a considerable optimization (for large data, e.g. 50k x 30k) to move them to .obsm remove those copies.

@asistradition
Copy link
Contributor Author

Also includes an optional control_genes argument to fit_size_factors which has the same behavior as the controlGenes argument to estimateSizeFactors and the associated unit test

@BorisMuzellec
Copy link
Collaborator

Thanks @asistradition for this PR!

I agree with @umarteauowkin that on principle storing _mu_LFC and _hat_diagonals in the obsm and not in the layers is a bit awkward, but if this leads to memory gains I'm fine with it, given that (as you pointed out) they are only used in cooks_distances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants