This data set is not exactly huge by any measure, but it is somehow complex and importing this data into Datastore involves the building of various aggregations to make our system run light and fast.
We basically have three datasets:
- Country names - Simple set with 250 entries, linking all iso codes to the respective country name in english. Importing this set is a breeze!
- World Locations - All cities, regions and locations of the world, with latitude and longitude as well as zipcode and state/region names. This set is as big as 316.848 entries. For every entry, e query the country dataset and retrieve the country name.
One important detail: we don't link the location table to the country table. We fetch the country name and save it into the location table. This can save a lot in future unnecessary requests to the country table.
In order to allow for easily searching for locations, we create another aggregation to store slices and references of this location in an easy to query manner.
Another important bit, is to aggregate locations by proximity. We need to know what regions are nearby, because querying a dataset of 316k+ entries everytime we need to build a test for a region is impossible.
Importing each entry, and building all the aggregations takes an average of 2 seconds (import rate is 30 per minute), and at that rate, it would take about 176 hours to finish (about 7 days).
- IPv4 Blocks - A large data set containing over 3 million entries, linking ipv4 ranges to locations. With this data set, we can tell from an IP what is its origin. For each entry here, we query the locations database, and save the location data into the IPv4 ranges table so we avoid the need of future unnecessary queries. Processing rate is around 40 entries per minute, and estimate time to finish is about 1538 hours, or about 64 days of continuous processing.
So, total time to import all this data would be around 71 days and hundreds of hours!
My first attempt involved the use of regular frontend instances for importing the data. I wrote a view that would accept a json containing the data, and would create the entries in the datastore. Problem with that is that you need to have a script running in your local machine, you need to send a bunch so you save time with latency, and you have to make that bunch fit into the 30 seconds request limit or risk to have data lost.
I managed to put this script to work, and it was doing great. But sometimes a given set of data would take more than 30 seconds and would screw up with the import from time to time and I had to baby sit the script all time and watch the logs. Soon I figured that it would take too long if I continued using this approach, and decided it was time to try something else.
Google provided appengine users with backend instances, that have to response time limitations and can work as dedicated machines for processing large amounts of data like what we have at hand. No need to have a script running on my machine, and more chances to abuse limitations (like urlfetch response time, for example).
Soon I figured that not all is possible there, though. First frustration was to figure that urlfetch can't download anything larger than 32mb, and I obviously needed to download a csv file much larger than this in order to parse the data.
Another limitation is the 10gb memory limit for backends. My script is very slim (each instance consumes only 50mb) but it will count memory towards the amount of memory the backend instance makes available for you, which means that if you get a larger instance, the amount of instances you can run at the same time is reduced drastically.
For example: I can run 80 B1 instances, 40 B2 instances, 20 B4 instances or 10 B8 instances. If your script needs to crunch a huge dataset and consumes a lot of data, you're stuck with that.
Fortunately, this isn't always necessary, and you can always try to combine backends with tasks and frontend instances if you want to go beyond these limits. You can also ask appengine support for an increase too, but not sure if this is due to a technical limitation or if they can easily give you more instances at will.
I went forth and decided to use 60 instances to crunch that data. The outcome: same final cost (in the end, you'll be processing the same amount of data, so same amount of instance hours) but much faster: 3 hours for the locations data set and almost 3 days for the ip blocks dataset. Much better!
To solve the 32MB limitation, I created a simple python cgi script and hosted in my dreamhost account to serve slices of the file on demand to each instance. Now, they're downloading just about 2MB worth of data and I can run B1 instances due to much lower memory footprint needed to run in this case.
We're also using memcached heavily to reduce amount of datastore api requests (and reduce costs with that, as well as speed up the processing).
After all, I'm happy with backends and how easy it is to just load 60 instances to crunch whatever amount of data you need. If I had to do that somewhere else it would be much harder and much more expensive than that.