Cleanup stale files by LianaHarris360 · Pull Request #3584 · learningequality/studio

LianaHarris360 · 2022-08-25T21:44:08Z

Summary

Description of the change(s) you made

This PR adds a function that cleans up File objects to the garbage collection utilities. Files with no associated ContentNode AssessementItem or SlideShow will be deleted, as well as files that have not been modified since the date specified (defaults to one month ago if no date is given).

The garbage collect management command is updated to run this function for any files not modified in the last month. This PR also includes a code clean-up of the decorators.py file. Sections deleted within this file were unused.

Reviewer guidance

How can a reviewer test these changes?

Tests for this function have been added to the test_garbage_collect.py file. Tests for the deletion of files that meet the specified requirements (no ContentNode AssessementItem or SlideShow association, and have not been modified in 1 month) and the non-deletion of files that have no ContentNode AssessementItem or SlideShow association, but have been modified more recently.

References

Fixes #3533

Contributor's Checklist

PR process:

If this is an important user-facing change, PR or related issue the CHANGELOG label been added to this PR. Note: items with this label will be added to the CHANGELOG at a later time
If this includes an internal dependency change, a link to the diff is provided
The docs label has been added if this introduces a change that needs to be updated in the user docs?
If any Python requirements have changed, the updated requirements.txt files also included in this PR
Opportunities for using Google Analytics here are noted
Migrations are safe for a large db

Studio-specifc:

All user-facing strings are translated properly
The notranslate class been added to elements that shouldn't be translated by Google Chrome's automatic translation feature (e.g. icons, user-generated text)
All UI components are LTR and RTL compliant
Views are organized into pages, components, and layouts directories as described in the docs
Users' storage used is recalculated properly on any changes to main tree files
If there new ways this uses user data that needs to be factored into our Privacy Policy, it has been noted.

Testing:

Code is clean and well-commented
Contributor has fully tested the PR manually
If there are any front-end changes, before/after screenshots are included
Critical user journeys are covered by Gherkin stories
Any new interactions have been added to the QA Sheet
Critical and brittle code paths are covered by unit tests

Reviewer's Checklist

This section is for reviewers to fill out.

Automated test coverage is satisfactory
PR is fully functional
PR has been tested for accessibility regressions
External dependency files were updated if necessary (yarn and pip)
Documentation is updated
Contributor is in AUTHORS.md

bjester · 2022-08-26T14:06:03Z

contentcuration/contentcuration/decorators.py

-    wrap.__name__ = function.__name__
-    return wrap
-
-


It would be a good idea to note in your PR description why we deleted these!

bjester · 2022-08-26T14:06:19Z

contentcuration/contentcuration/tests/utils/test_garbage_collect.py

+
+
+def _create_stale_file(user, modified_date):
+    checksum = '%32x' % random.getrandbits(16 * 8)


…(previously test_user.py) for consistency and in response to user.py test case fail.

rtibbles

One important change needed to make sure we delete everything. One other which is more precautionary.

rtibbles · 2022-08-26T22:12:37Z

contentcuration/contentcuration/utils/garbage_collect.py

+
+
+@delay_user_storage_calculation
+def clean_up_stale_files(last_modified=now() - datetime.timedelta(days=30)):


Might be good to define the default inside the function scope. At the moment, by defining it in the function definition, now() will be whatever now() was at the time the module was imported. So if this function was run from inside a process that had been running for a hundred days, it would actually be looking for last_modified that was 130 days ago.

The usual Python pattern for this is to set a default of None, and then do if last_modified is None inside the function and then set it to the default value.

The big gotcha that can sometimes happen here is if the default value is set to a mutable type like a dict, and then successive invocations of the function can actually modify and share the default value.

See here: https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments for more about that.

That is a good point that I did not realize, thank you! I will make those changes and update the pr.

rtibbles · 2022-08-26T22:25:13Z

contentcuration/contentcuration/management/commands/garbage_collect.py

        logging.info("Cleaning up feature flags")
        clean_up_feature_flags()
+        logging.info("Cleaning up stale file objects")
+        clean_up_stale_files()


This function is only cleaning up 10000 files at a time, so each run of the garbage collection job (which runs daily on production) will only clear up 10000 files - instead we should be running until there are no more files that meet the criteria. One possibility would be to return the number of deleted files, and keep on trying to delete until the number of deleted files is zero.

If we don't do this, it will take us approximately 2 years to clear all the existing 'orphaned' file objects at just 10,000 a day!

@rtibbles It's actually 100k, not 10k. We decided to limit this because we don't know how long deleting the 7 million would take. I think what the actual limit is could be adjusted. We looked at what the cron schedule of the job was in production, and a very conservative value was chosen.

Ah, I assumed that this was intended as a batching parameter to limit the length of any particular delete query. Even with 100000 it will still be 70 days before the backlog is cleared.

Is there a specific concern with the scheduled job taking a long time? It seems like it will be a one time long run and then after that will just be keeping on top of it.

We saw that the cron jobs are scheduled to run every 24 hours. While looking at Django’s delete() method, we noticed that it also triggers the deletion signals, which could take some time to process, along with the memory usage of deleting millions of file objects in one go. The potential for the job to run past 24 hours was our main concern for the limit.

There seem to be two concerns here - one about limiting the memory usage and time of the deletion operation, and one about the run time of the cronjob.

I think we should be limiting the memory usage, and could even reduce the batch size further to remove any concerns.

For the cronjob, I think it is more important that it runs to completion than it finish within 24 hours - as this is parallel to the other deletions that are happening in the cronjob. Without ensuring that all the file objects matching this criteria are deleted, we continue to run the risk of a file table filled with an ever expanding number of 'junk' files.

In terms of concerns about signals, there is one signal that is currently attached to the post_delete signal of the File object, which recalculates file storage:

studio/contentcuration/contentcuration/models.py

Line 2232 in 4444843

def auto_delete_file_on_delete(sender, instance, **kwargs):

- you are already handling this using the decorator that delays storage recalculation.

I don't know that it would deadlock, maybe the subsequent or existing run would end up running superfluous deletes?

At the moment, I think this is plausible when running the KA EN Chef (although I am trying to fix the ricecooker behaviour to prevent this).

Since there is quite a bit of unknown here, seems like it would be safer to limit it then. So how about a limit of 500k? Or 1 million?

I am fine with a higher limit - do we still want to batch though to limit memory usage? Presumably for the post_delete signal the full model representation is being loaded, and I'm not sure if Django is batching that or not.

Seems like it doesn't do any batching, but it does check if there are any signal handlers to see whether it needs to load the model representations - this might explain why the other delete methods for garbage collection are explicitly disabling the signal handlers with a context manager.

https://docs.djangoproject.com/en/3.2/ref/models/querysets/#delete

So if we are going to increase the limit, we probably need to disable the signal handlers, rather than just deferring the user storage calculation.

However, it seems like we don't actually need to run the user storage calculation at all, as by definition, these files are not part of any active trees (and the user storage calculation only calculates for active trees).

Got it, I will update the function to include a higher limit and remove the storage calculation. Do you have any good examples and/or links for batching?

… higher limit and added DisablePostDeleteSignal context manager

rtibbles

This looks good - we can see how it behaves on unstable and make further updates if needed!

bjester and others added 3 commits August 17, 2022 13:47

WIP: adding stale file clean up

8e09d0f

Update stale file cleanup function with modified date

5b6dadf

Merge branch 'clean-up-stale-files' into cleanup-stale-files

16a60ca

LianaHarris360 requested review from bjester and rtibbles August 25, 2022 21:44

bjester reviewed Aug 26, 2022

View reviewed changes

Created and updated file location and imports for test_decorators.py …

4ed66c3

…(previously test_user.py) for consistency and in response to user.py test case fail.

rtibbles requested changes Aug 26, 2022

View reviewed changes

Defined default variable inside function scope, updated function with…

c1c1189

… higher limit and added DisablePostDeleteSignal context manager

LianaHarris360 requested a review from rtibbles August 29, 2022 23:07

Updated clean up stale files function to include batch deletion

bf5cf4b

rtibbles approved these changes Sep 1, 2022

View reviewed changes

rtibbles merged commit c603816 into learningequality:unstable Sep 1, 2022

This was referenced Sep 15, 2022

Hotfixes release for development ending 2022-09-16 #3644

Merged

Release v2022.10.03 #3658

Merged



		def _create_stale_file(user, modified_date):
		checksum = '%32x' % random.getrandbits(16 * 8)



		@delay_user_storage_calculation
		def clean_up_stale_files(last_modified=now() - datetime.timedelta(days=30)):

Uh oh!

Conversation

LianaHarris360 commented Aug 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Description of the change(s) you made

Reviewer guidance

How can a reviewer test these changes?

References

Contributor's Checklist

Reviewer's Checklist

This section is for reviewers to fill out.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rtibbles left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjester Aug 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rtibbles left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LianaHarris360 commented Aug 25, 2022 •

edited

Loading

bjester Aug 26, 2022 •

edited

Loading