I am currently working on an application where I want to have a search feature that allows people to search for businesses within a certain distance of their home (or anywhere else they care to choose).
I have some old UK postcode data knocking around, and was going to use that. For those not familiar with them, UK postcodes are made up of two parts, a major (also known as “outward”) part, and a minor (or “inward”) part. The major part is one or two letters followed by one or two digits, and the minor part is a digit, followed by two letters. Examples of valid postcode formats are M25 0LE, NW11 3ER and L2 3WE (no idea if these are genuine postcodes though).
Coupled with the postcodes are northings and eastings. These odd-sounding beasties are simply the number of metres north and east from a designated origin, which is somewhere west of Lands End. See the Ordinance Survey web site for more details. If you have any two postcodes, you can calculate the distance between them (as the crow flies) by a simple application of Pythagoras’ Theorem. My intention was to use all of this in my search code.
Whilst doing some ferreting around the web, looking for more up-to-date data, I found out that you can now get the UK postcode data absolutely free! When I last looked, which was some years ago admittedly, they charged an arm and a leg for this. Now all you need to do is order it from their downloads page, and you get sent a link to download the data. They have all sorts of interesting data sets there (including all sorts of maps, street views and so on), but the one I wanted was the Code-Point Open set. This was far more up-to-date than the data I had, and was a welcome discovery. Now all I had to do was import it, and write some calculation code.
Before I got any further though, I further discovered that SQL Server 2008 has a new data type known as geography data, which is used to represent real-world (ie represented on a round Earth) co-ordinates. It also has the geometry data type, which is a similar idea, but uses a Euclidean (ie flat) co-ordinate system. Given that I am only dealing with the UK, the curvature of the Earth isn’t significant in distance calculations, and so either data type would do.
However, converting northings and eastings to geography data isn’t as simple as you might hope, but I found a post on Adrian Hill’s blog, where he described how to convert OS data into geometry data, provided a C# console program that does just that, and then showed how easy it is to query the data for distances between points. In other words, exactly what I wanted!
I won’t describe the details here, because you can follow that link and read it for yourself, but basically all you need to do is get the Open-Point data, download the converter and away you go. Truth be told, it wasn’t quite that simple as the format of the data has changed slightly since he wrote the code, so I needed to make a small change. I left a comment on the post, and Adrian updated the code, so you shouldn’t need to worry about that. It doesn’t use hard-coded column numbers anymore, so should be future-proof, in case the OS people change the format again.
Once I had the data in SQL Server, I wanted to see how to use it. In the blog post, Adrian showed a simple piece of SQL that did a sample query. You can see the full code on his blog, but the main part of it was a call to an SQL Server function named GeoLocation.STDistance() that did the calculation. He commented that when he tested it, he searched for postcodes within five miles of his home, and got 8354 rows returned in around 500ms. I was somewhat surprised when I tried it to discover that it took around 14 seconds to do a similar calculation! Not exactly what you’d call snappy, and certainly too slow for a (hopefully) busy application.
I was a bit disappointed with this, but decided that for my purposes, it would be accurate enough to do the calculation based on the major part of the postcode only. One of the things Adrian’s code did when importing the data to SQL Server was create rows where the minor part was blank, and the geography value was an average for the whole postcode major area. I adjusted my SQL to do this, and I got results quickly enough that the status bar in SQL Server 2008 Management Studio reported them as zero. OK, so the results won’t be quite as accurate, but at least they will be fast.
Whilst I was writing this blog post, I looked back at Adrian’s description again, and noticed one small bullet point that I had missed. When describing what his code did, he mentioned that it created a spatial index on the geography column. Huh? What on earth (pardon the pun) is a spatial index. No, I had no idea either, so off I went to MSDN, and found an article describing how to create a spatial index. I created one, which took a few minutes (presumably because there were over 17 million postcodes in the table), and tried again. This time, I got the performance that Adrian reported. I don’t know if his code really should have created the spatial index, but it didn’t. However, once it was created, the execution speed was fast enough for me to use a full postcode search in my application.
So, all armed with working code, my next job was to code the search in my application. This seemed simple enough, just update the Entity Framework model with the new format of the postcodes table, write some Linq to do the query and we’re done… or not! Sadly, Entity Framework doesn’t support the geography data type, so it looked like I couldn’t use my new data! This was a let-down to say the least. Still, not to be put off, I went off and did some more research, and realised that it was time to learn how to use stored procedures with Entity Framework. I’d never done this before, simply because when I discovered Entity Framework, I was so excited by it that I gave up writing SQL altogether. All my data access code went into the repository classes, and was written in Linq.
Creating a stored procedure was pretty easy, and was based on Adrian’s sample SQL:
create procedure GetPostcodesWithinDistance @OutwardCode varchar(4), @InwardCode varchar(3), @Miles int as begin declare @home geography select @home = GeoLocation from PostCodeData where OutwardCode = @OutwardCode and InwardCode = @InwardCode select OutwardCode, InwardCode from dbo.PostCodeData where GeoLocation.STDistance(@home) <= (@Miles * 1609) and InwardCode <> '' end
Having coded the stored procedure, I now had to bring it into the Entity Framework model, and work out how to use it. The first part seemed straightforward enough (doesn’t it always?). You just refresh your model, open the “Stored Procedures” node in the tree on the Add tab, select the stored procedure, click OK and you’re done. Or not. The slight problem was that my fantastic new stored procedure wasn’t actually listed in the tree on the Add tab. It was a fairly simple case of non-presence (Vic wasn’t there either).
After some frustration, someone pointed out to me that it was probably due to the database user not having execute permission on the stored procedure. Always gets me that one! Once I had granted execute permission, the stored procedure appeared (although Vic still wasn’t there), and I was able to bring it into the model. Then I right-clicked the design surface, chose Add –> Function Import and added the stored procedure as a function in the model context. Finally I had access to the code in my model, and could begin to write the search.
Just to ensure that I didn’t get too confident, Microsoft threw another curve-ball at me at this point. My first attempt at the query looked like this:
IEnumerable<string> majors = from p in jbd.GetPostcodeWithinDistance("M7", 45) IQueryable<Business> localBusinesses = from b in ObjSet where (from p in majors where p.Major == b.PostcodeMajor select p).Count() > 0 select b;
The first query grabs all the postcodes within 45 miles of the average location of the major postcode M7.Bear in mind that this query was written before I discovered the spatial index, so it only uses the major part of the postcode. A later revision uses the full postcode. The variable jbd is the context, and ObjSet is an ObjectSet<Business> which is created by my base generic repository.
When this query was executed, I got a delightfully descriptive exception, “Unable to create a constant value of type ‘JBD.Entities.Postcode’. Only primitive types (‘such as Int32, String, and Guid’) are supported in this context.” Somewhat baffled, I turned the Ultimate Source Of All Baffling Problems known as Google, and discovered that this wasn’t actually the best way to code the query, even if it had worked. I had forgotten about the Contains() extension method, which was written specifically for cases like this.
The revised code looked like this (first query omitted as it didn’t change):
IQueryable<Business> data = from b in ObjSet where majors.Contains(b.PostcodeMajor) select b;
Somewhat neater, clearer and (more to the point) working!
So, with working code, I now only had more hurdle to jump before I could claim this one had been cracked.
All of the above was going on on my development machine. Now I had working code, I wanted to put the data on the production server, so that the application could use it. This turned out to be harder than I thought. I tried the SQL Server Import/Export wizard, which usually works cleanly enough, but got an error telling me that it couldn’t convert geography data to geography data. Huh? Searching around for advice, I found someone who suggested trying to do the copy as NTEXT, NIMAGE, and various other data types, but none of these worked either.
After some more searching, I discovered an article describing it, that says that SQL Server has an XML file that contains definitions for how to convert various data types. Unfortunately, Microsoft seem to have forgotten to include the geography and geometry data types in there! I found some code to copy and paste and, lo and behold, it gave the same error message! Ho hum.
After some more messing around and frustrated attempts, I mentioned the problem to a friend of mine who came up with the rather simple suggestion of backing up the development database and attaching it to the SQL Server instance on the production machine. I had thought of this, but didn’t want to do it as the production database has live data in it, which would be lost. He pointed out to me that if I attached it under a different name, I could then copy the data from the newly-attached database to the production one (which would be very easy now that they were both on the same instance of SQL Server), and then delete the copy of the development database.
Thankfully, this all worked fine, and I finally have a working geographic search. As usual, it was a rough and frustrating ride, but I learned quite a few new things along the way.