Importing ListHub data using Laravel
ListHub is probably the closest thing to getting nationwide MLS listings at 1.4 million listings. The bad new is ListHub doesn’t use a Web API, instead the national feed is a 14GB XML file. This presets some pretty big challenges if you’re trying to import listings. Fortunately I’m going to give you a high level overview how we do things at Rets Rabbit to give our ListHub customers a Web API.
Aside from dealing with a very large file (14 GB!) and processing over a million records, there’s also an issue with effectively parsing updates and removing expired listings. ListHub updates the file 3-4 times a day, but unlike RETS, they don’t do diffs, there’s only one file that contains all of the listings. You have to parse the whole file every time you run an update. So you need to make sure your import process is fast, which means you need to design things to run in parallel as much as possible.
At Rets Rabbit, we built everything to run off of a modern PHP framework called Laravel, which supports a lot of tools that can make processing large jobs easier. I’m going to break down the components in our import process:
- An automated task to download the LH file on a schedule
- A fast XML streamer library
- Queue to allow parallel jobs to be scheduled to processes the feed in segments
- Import job code to read X lines and quickly update or add listings
- Redis cache to record “active” listings as the LH feed is scanned, listings will be marked as expired if they’re not able to be looked up in the cache
Quickly reading a very very large XML file
When I first learned I’d have to process a 14 GB xml file, one obvious problem that I realized is that PHP’s standard method of reading XML,
simplexml_load_file would not work for this use case as it loads the entire XML file in memory. What I needed was a way to stream the XML file one line at a time. PHP has a number of XML streaming options, but they all seemed pretty complex. What I needed was something that gave me the speed, but was also easy to work our existing import process around.
After some research I found a great XML streamer at: https://github.com/prewk/xml-string-streamer. This had the advantage of being nearly as easy to use as
simplexml_load_file at it was fast. Using the
createUniqueNodeParser can process 10K lines in a couple of minutes.
The streamer returns the XML node as a string, which can then be piped into the PHP
simple_load_string method to parse the XML file into a PHP associated array. This fit pretty well into our existing RETS import process.
Parallel Queued Jobs
Laravel supports asynchronous tasks using a queue service. Using Laravel queues and a distributed queue provider such as Amazon SQS. You can break up large jobs (such as importing 1.4 million listings and images) into smaller tasks that can be run in parallel. This is how we at Rets Rabbit can process over a quarter million listings every day.
At Rets Rabbit we run four separate queues: listing, images, images_fresh, and default.
Every three to four hours jobs are scheduled on the listing queue that segment the main LH file into 200K lines. A worker process (5-6 usually run at once) pulls a job off the queue and reads its segment in the LH file. The worker reads a line in the LH file and tries to retrieve a listing by MLS#.
If the listing is found in our database, it checks the last time the listing was update in our DB, versus the modified date in the XML file. If the modified date is later than the updated date in our DB, the listing is updated. Another job is put on the
images queue for a separate worker to import images.
By using queues and parallel jobs we can process 30-50 listings at a simultaneously, which allows us to scan 1.4M listings. Since ListHub doesn’t update every listing daily (usually it updates 100K-200K) after the initial import things move pretty fast. You can see from the screenshot below one of our recent jobs process 200K listings in 12 minutes
Each time we scan the LH file when we update, the MLS# is added onto a redis cache. Redis is a really quick caching system that also allows us to “expire” keys after a period of time. This is convenient for cleaning up listings that have been removed from the LH feed. Every night a job is added to the “default” queue which will loop through our database and check the cache for the stored MLS#.
If the MLS# can’t be found in our cache, the listing is then marked inactive. This allows us to quickly mark listings as expired without having to process the XML file again.
Parsing ListHub presents it’s own challenges over importing through RETS. By breaking up the file into segments that can be processes in parallel and using a fast XML streamer, you can build a system to keep your app importing LH quickly and reliably.
You can also just forget all of that and use Rets Rabbit instead ;), we’ll handle the import and give you a nice REST API that you can use to build great property searches into your web or mobile application. Email me at firstname.lastname@example.org and I’ll answer any questions about our system.