Hello,
If I'm reading the docs correctly, when we run manage.py update_index, it should be an incremental index of new emails that have been added since the last time update_index was run - can anyone confirm that this is the expected behavior? Or is there another command line option that I am missing?
Right now we are indexing EVERYTHING every time that this is run. Probably this wouldn't be much of a problem except that we are importing all of our old archives (they go back to 1998 and comprise over 40.5 million emails) into the new system. As of now we have around 700,000 emails imported into hyperkitty, and since it is indexing everything it is taking around 1 hour and 10 minutes. Obviously we can't run that once a minute. Right now we're doing once a day, meaning that new emails won't be indexed until the next day.
Fast forward to when we have all 40 million imported...the full index process will take (we estimate) around 4 days or so.
So...can anyone help with how to get the update_index to do only emails added since the last time the update_index was run?
Thanks!
Darren
On 04/03/2018 09:02 AM, Darren Smith wrote:
Hello,
If I'm reading the docs correctly, when we run manage.py update_index, it should be an incremental index of new emails that have been added since the last time update_index was run - can anyone confirm that this is the expected behavior? Or is there another command line option that I am missing?
There are two ways to run update_index.
Hyperkitty's update_index is incremental and is run by
manage.py runjob update_index
Running
manage.py update_index
runs haystack's update_index which with no options does a full update. Run
manage.py update_index --help
to see the options. The incremental 'manage.py runjob update_index' should be run every minute by 'manage.py runjobs minutely' assuming cron is running the periodic Django jobs, so running either update_index job manually shouldn't be necessary
...
So...can anyone help with how to get the update_index to do only emails added since the last time the update_index was run?
manage.py runjob update_index
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Holy cow I think that's exactly what the problem is. I will give that a try and let you know what happens!
On Tue, Apr 3, 2018 at 10:50 AM, Mark Sapiro <mark@msapiro.net> wrote:
Hello,
If I'm reading the docs correctly, when we run manage.py update_index, it should be an incremental index of new emails that have been added since
On 04/03/2018 09:02 AM, Darren Smith wrote: the
last time update_index was run - can anyone confirm that this is the expected behavior? Or is there another command line option that I am missing?
There are two ways to run update_index.
Hyperkitty's update_index is incremental and is run by
manage.py runjob update_index
Running
manage.py update_index
runs haystack's update_index which with no options does a full update. Run
manage.py update_index --help
to see the options. The incremental 'manage.py runjob update_index' should be run every minute by 'manage.py runjobs minutely' assuming cron is running the periodic Django jobs, so running either update_index job manually shouldn't be necessary
...
So...can anyone help with how to get the update_index to do only emails added since the last time the update_index was run?
manage.py runjob update_index
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Mailman-users mailing list mailman-users@mailman3.org https://lists.mailman3.org/mailman3/lists/mailman-users.mailman3.org/
OK - next issue...here's the warning that I'm receiving when I run that:
/__init__.py:1451: RuntimeWarning: DateTimeField Email.archived_date received a naive datetime (2018-04-03 07:51:25) while time zone support is active. RuntimeWarning)
I'm not sure that anything else occurred. The indexing took 2 seconds, and there have been tens of thousands of emails imported since the last index. Thoughts?
-Darren
On Tue, Apr 3, 2018 at 10:55 AM, Darren Smith <silas.crutherton@gmail.com> wrote:
Holy cow I think that's exactly what the problem is. I will give that a try and let you know what happens!
On Tue, Apr 3, 2018 at 10:50 AM, Mark Sapiro <mark@msapiro.net> wrote:
Hello,
If I'm reading the docs correctly, when we run manage.py update_index, it should be an incremental index of new emails that have been added since
On 04/03/2018 09:02 AM, Darren Smith wrote: the
last time update_index was run - can anyone confirm that this is the expected behavior? Or is there another command line option that I am missing?
There are two ways to run update_index.
Hyperkitty's update_index is incremental and is run by
manage.py runjob update_index
Running
manage.py update_index
runs haystack's update_index which with no options does a full update. Run
manage.py update_index --help
to see the options. The incremental 'manage.py runjob update_index' should be run every minute by 'manage.py runjobs minutely' assuming cron is running the periodic Django jobs, so running either update_index job manually shouldn't be necessary
...
So...can anyone help with how to get the update_index to do only emails added since the last time the update_index was run?
manage.py runjob update_index
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Mailman-users mailing list mailman-users@mailman3.org https://lists.mailman3.org/mailman3/lists/mailman-users.mailman3.org/
On 04/03/2018 09:59 AM, Darren Smith wrote:
OK - next issue...here's the warning that I'm receiving when I run that:
/__init__.py:1451: RuntimeWarning: DateTimeField Email.archived_date received a naive datetime (2018-04-03 07:51:25) while time zone support is active. RuntimeWarning)
That warning comes from dateutil.parsedate, and I think it's just a warning and shouldn't affect processing.
I'm not sure that anything else occurred. The indexing took 2 seconds, and there have been tens of thousands of emails imported since the last index. Thoughts?
I'm unsure about what happens when 'runjob update_index' runs after an import. I.e. it adds messages added since the last run, but is this based on when the message was added or on some date in the added message, so if the added messages are old, do they get indexed by 'runjob update_index'.
You could easily check by doing some searches that should find them and see if they're found.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (2)
-
Darren Smith
-
Mark Sapiro