Not too long ago, I was faced with the task of offloading one of my customer’s data into yet another near-government format. Among other things, the upload needed to provide the postal addresses of individual customers in a structured way, including zip code, region, district, and so on down to the apartment number.
Everything would be fine, but the problem is that the original client addresses were written in a simple string type "Kitezhgrad, st.Volshebnaya 22 house kvartir.15". That is, on the one hand, no one had ever heard of zip codes, but on the other hand, the text entry field offers a wide space for self-expression and folk art.
Out of the blue I went to look for a solution to this problem on the net, reasoning that this situation should be very common and someone unequivocally defeated. So it turned out, but instead of sources or just compiled libs I was greedily stared at online services, offering to parse e-mail addresses for the real money (the lowest price I could find was 10 kopecks per address).
Since to give profit voluntarily to some external organization I did not like, and by that time appeared a certain excitement, I wanted to make a decision myself, with as little effort as possible. What made the task easier was that the client didn’t need high parsing accuracy – any kind of errors wouldn’t lead to fatal problems.
To begin with, I looked in the direction of the Tomita parser, but after seeing the multi-page config example, which allows you to determine by text which city someone lives in ( http://api.yandex.ru/tomita/doc/dg/concept/example.xml ), the optimism was somewhat diminished, but the desire to write some bike of my own was strengthened.
Naturally, with rather tight restrictions on the input data under which we will continue our research :
- The address is always written without typos : "Arfografia Avenue" let it remain on the conscience of the person who enters it.
- The address is written from the maximum common element (area) to the maximum private element (apartment number).
- In view of paragraph 2, screw the words like "area", "street", "Avenue", "house". So if the city has both an avenue named after Teletubbies and a street named after them, we won’t be able to catch such a fine line. Given the rarity of such a situation and the right to make a mistake, that’s a workable option.
Then I started searching for a data source, from which I could get information about postal codes. As it turned out, on this day KLADR is yesterday’s day, FIAS rules ( http://fias.nalog.ru ).Having downloaded an offline copy of this database, I began to explore the features it provides.
I was especially interested in two tables: ADDROBJ which contains tree-like list of all address objects, starting from Russian Federation subject and ending with street, and HOUSE<region> where house numbers are stored with reference to ADDROBJ records together with their indices. The information stored in these two tables is sufficient for both purposes: to verify the correctness of the address parsing (if we can find the address in the database, it means we recognized it correctly), as well as to determine the postal code.
An algorithm began to take shape in my head:
- Split the string of the postal address into address elements. By an address element I mean what can be found in the form of a FIAS table row: district, city, street, house, and apartment number.
- Unconditional separators for address elements include dots, commas, semicolons, and slashes.
- Conditional separators include hyphens and dashes if the address element after the hyphen is a number. For example in "38a-117 Depression Lane" a hyphen is a delimiter, but in "Ust-Zazhopinsk" it is not.
- A space may or may not be a separator. Thus, in "March 8, d.15" space between "March" and "March" obviously should not divide elements, but between "March" and "e." – it should. The easiest way to solve this problem head-on is to make all possible ways to divide the address elements by spaces and continue further work of the algorithm with each of them separately.
The algorithm was empirical, naive, and did not provide for all possible situations… But for the limitations of implementation speed and extensive error rights, it looked quite satisfactory. It took about 1 night to make and implement.
For convenience, I converted tables ADDROBJ and HOUSEXX from DBF to MS SQL (to learn how to convert them easily, I read here : http://blogs.technet.com/b/isv_team/archive/2012/05/14/3497825.aspx ).
As a result, we have AddressParser class, which takes an address string as input and outputs an instance of Address class in response. You can feed your own implementation of IKnwonAddressComparator into AddressParser constructor, if current implementation, customized for MS SQL doesn’t suit you in some way.
The parsing speed is something like 2-5 addresses per second. It’s bad, but it’s better than doing it by hand. The main problem: the serious number of variants to check, generated by point 1.3. Ideally, this point should be completely rewritten, using the address database already at this stage to check the existence of address elements. As an intermediate option, you can limit the number of variants to some value.
For a random sample, the parsing quality was 62% on real data. For evaluation, the address was considered to be either fully recognizable (all the way down to the apartment) or not. No halftones.
The error distribution is as follows :
- 37% – typos. As a rule, trivial omissions-additions of letters: "Kirov", "Moskv"…
- 21% – use of abbreviations : "K. Marx", "R. Luxemburg"…
- 42% – absence of houses in database of FIAS, and, as a result, impossibility to define the index and to fill in the whole chain. Quite an unexpected reason for me, although many write that FIAS is still raw for industrial use
What conclusions can be drawn?
If you need, as I do, low parsing quality and low speed – you can use it.
If you need to significantly increase the accuracy of recognition, you can try introducing fuzzy search. In addition, by adding a list of popular abbreviations, you can also tighten the percentage of successfully recognized addresses.
Performance is a separate song, which, given the elementary and suboptimal implementation, can also be addressed. The first candidate here is smart space splitting.
But all that is another story.
The source codes can be downloaded from here : https://yadi.sk/d/muzi9b6qZ8DWh
A test MS SQL database with houses for regions 38 and 78 can be taken here : https://yadi.sk/d/ERXyDXv7Z8Dab