Cleanup stale files#3584
Conversation
| wrap.__name__ = function.__name__ | ||
| return wrap | ||
|
|
||
|
|
There was a problem hiding this comment.
It would be a good idea to note in your PR description why we deleted these!
|
|
||
|
|
||
| def _create_stale_file(user, modified_date): | ||
| checksum = '%32x' % random.getrandbits(16 * 8) |
…(previously test_user.py) for consistency and in response to user.py test case fail.
rtibbles
left a comment
There was a problem hiding this comment.
One important change needed to make sure we delete everything. One other which is more precautionary.
|
|
||
|
|
||
| @delay_user_storage_calculation | ||
| def clean_up_stale_files(last_modified=now() - datetime.timedelta(days=30)): |
There was a problem hiding this comment.
Might be good to define the default inside the function scope. At the moment, by defining it in the function definition, now() will be whatever now() was at the time the module was imported. So if this function was run from inside a process that had been running for a hundred days, it would actually be looking for last_modified that was 130 days ago.
The usual Python pattern for this is to set a default of None, and then do if last_modified is None inside the function and then set it to the default value.
The big gotcha that can sometimes happen here is if the default value is set to a mutable type like a dict, and then successive invocations of the function can actually modify and share the default value.
See here: https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments for more about that.
There was a problem hiding this comment.
That is a good point that I did not realize, thank you! I will make those changes and update the pr.
| logging.info("Cleaning up feature flags") | ||
| clean_up_feature_flags() | ||
| logging.info("Cleaning up stale file objects") | ||
| clean_up_stale_files() |
There was a problem hiding this comment.
This function is only cleaning up 10000 files at a time, so each run of the garbage collection job (which runs daily on production) will only clear up 10000 files - instead we should be running until there are no more files that meet the criteria. One possibility would be to return the number of deleted files, and keep on trying to delete until the number of deleted files is zero.
If we don't do this, it will take us approximately 2 years to clear all the existing 'orphaned' file objects at just 10,000 a day!
There was a problem hiding this comment.
@rtibbles It's actually 100k, not 10k. We decided to limit this because we don't know how long deleting the 7 million would take. I think what the actual limit is could be adjusted. We looked at what the cron schedule of the job was in production, and a very conservative value was chosen.
There was a problem hiding this comment.
Ah, I assumed that this was intended as a batching parameter to limit the length of any particular delete query. Even with 100000 it will still be 70 days before the backlog is cleared.
Is there a specific concern with the scheduled job taking a long time? It seems like it will be a one time long run and then after that will just be keeping on top of it.
There was a problem hiding this comment.
We saw that the cron jobs are scheduled to run every 24 hours. While looking at Django’s delete() method, we noticed that it also triggers the deletion signals, which could take some time to process, along with the memory usage of deleting millions of file objects in one go. The potential for the job to run past 24 hours was our main concern for the limit.
There was a problem hiding this comment.
There seem to be two concerns here - one about limiting the memory usage and time of the deletion operation, and one about the run time of the cronjob.
I think we should be limiting the memory usage, and could even reduce the batch size further to remove any concerns.
For the cronjob, I think it is more important that it runs to completion than it finish within 24 hours - as this is parallel to the other deletions that are happening in the cronjob. Without ensuring that all the file objects matching this criteria are deleted, we continue to run the risk of a file table filled with an ever expanding number of 'junk' files.
In terms of concerns about signals, there is one signal that is currently attached to the post_delete signal of the File object, which recalculates file storage:
studio/contentcuration/contentcuration/models.py
Line 2232 in 4444843
There was a problem hiding this comment.
I don't know that it would deadlock, maybe the subsequent or existing run would end up running superfluous deletes?
At the moment, I think this is plausible when running the KA EN Chef (although I am trying to fix the ricecooker behaviour to prevent this).
Since there is quite a bit of unknown here, seems like it would be safer to limit it then. So how about a limit of 500k? Or 1 million?
There was a problem hiding this comment.
I am fine with a higher limit - do we still want to batch though to limit memory usage? Presumably for the post_delete signal the full model representation is being loaded, and I'm not sure if Django is batching that or not.
There was a problem hiding this comment.
Seems like it doesn't do any batching, but it does check if there are any signal handlers to see whether it needs to load the model representations - this might explain why the other delete methods for garbage collection are explicitly disabling the signal handlers with a context manager.
https://docs.djangoproject.com/en/3.2/ref/models/querysets/#delete
There was a problem hiding this comment.
So if we are going to increase the limit, we probably need to disable the signal handlers, rather than just deferring the user storage calculation.
However, it seems like we don't actually need to run the user storage calculation at all, as by definition, these files are not part of any active trees (and the user storage calculation only calculates for active trees).
There was a problem hiding this comment.
Got it, I will update the function to include a higher limit and remove the storage calculation. Do you have any good examples and/or links for batching?
… higher limit and added DisablePostDeleteSignal context manager
rtibbles
left a comment
There was a problem hiding this comment.
This looks good - we can see how it behaves on unstable and make further updates if needed!
Summary
Description of the change(s) you made
This PR adds a function that cleans up File objects to the garbage collection utilities. Files with no associated ContentNode AssessementItem or SlideShow will be deleted, as well as files that have not been modified since the date specified (defaults to one month ago if no date is given).
The garbage collect management command is updated to run this function for any files not modified in the last month. This PR also includes a code clean-up of the decorators.py file. Sections deleted within this file were unused.
Reviewer guidance
How can a reviewer test these changes?
Tests for this function have been added to the test_garbage_collect.py file. Tests for the deletion of files that meet the specified requirements (no ContentNode AssessementItem or SlideShow association, and have not been modified in 1 month) and the non-deletion of files that have no ContentNode AssessementItem or SlideShow association, but have been modified more recently.
References
Fixes #3533
Contributor's Checklist
PR process:
CHANGELOGlabel been added to this PR. Note: items with this label will be added to the CHANGELOG at a later timedocslabel has been added if this introduces a change that needs to be updated in the user docs?requirements.txtfiles also included in this PRStudio-specifc:
notranslateclass been added to elements that shouldn't be translated by Google Chrome's automatic translation feature (e.g. icons, user-generated text)pages,components, andlayoutsdirectories as described in the docsTesting:
Reviewer's Checklist
This section is for reviewers to fill out.
yarnandpip)