Bread Crumbs

There are a number of projects that are trying to setup open databases with cell tower and WiFi Access Point (AP) locations. These are useful to me as I normally run with data off and by using on-phone copies of (a subset of) their data to estimate my location my GPS can have a much quicker “time to first fix”.

These projects use volunteers who install “network stumbling” apps on their smartphone to collect the data. Since I am using their data I would like to contribute.

There are a couple of privacy concerns that cause me to hesitate.

  • All the WiFi and cell tower collection software I’ve looked at seem to collect new samples in areas where you have already been. These samples are uploaded from your phone and then deleted from the phone. Once deleted the phone cannot detect that it is collecting data in the same places as it did earlier so it will collect them again. The result is that there will be many more samples collected near your home, your work, favorite shops or other places you hang out than there will be elsewhere.
  • All the projects are fairly opaque on how they use and retain the data they collect. Often it is unclear if the project is gathering data that can tie your submissions to you or your phone. While there are arguments on both sides of keeping some indicator of the source of the data it would be nice to know what each project does.

If there were enough volunteers there might be safety in numbers. If several people in your neighborhood are contributing there is a reasonable chance that the data you submit will be harder to detect and isolate. Assuming that the project does a reasonable job of anonymizing   contributions.

Open WLAN Map is one of the very few WiFi AP mapping sites that allows bulk download of some of their data (contributors can decline to have their data published in the bulk download file). As of the time of this writing they seem to have 246 contributors world wide. If you contribute to them you are likely to be the only contributor for many miles around and there will be no safety in numbers. Your house and work might be hard to detect in the final output they provide but they will definitely be visible in the raw data.

This has led me to think about how data could be collected in a way that reduces the “bread crumb” trail that could identify you and your haunts more than you would like.

I suspect that projects like Open WLAN Map are basically doing a signal strength weighted average of all the points they receive for any given AP. With enough data it should give a pretty accurate location and is conceptually easy to do. Open WLAN Map definitely retains information tying the contributors to the data contributed. So they, or some hacker who breaks into their system, might find the “bread crumb” trail you have contributed and learn more about you that you might like.

Trilateral Points

So how could your phone collect useful information without over reporting your activities? My thought is that you could look for where the signal fades out rather than where it is strongest. That is find the edge of the roughly circular coverage area rather than the center. The goal would be to find three points on the circumference that describe the coverage area. Those three points also describe a triangle with an easy to compute perimeter distance.

Your best three points will be evenly spaced around the circumference which is also where the triangle perimeter is maximized.

When a new sample is received for an AP a check is made to see if replacing any one of the old three points will increase the perimeter. If so, then save the new sample and drop the one it replaced. After a short while your phone will have a pretty stable set of three points and will no longer record new data for that AP. Your house and your neighbor’s house will have the same number of samples so your house won’t stand out as much.

Instead of collecting multiple samples per minute for every AP seen leading to thousands of samples in a very short time you have a smaller data set that is limited to three times the number of APs your phone has detected. So keeping your collection history on your phone now becomes possible.

By retaining a memory of the best samples that describe the location of each AP the phone can limit data sent to newly discovered APs and APs for which you have acquired better data. But it should not send data repeatedly for areas that you frequent. The number of APs that one phone collects is quite small compared to the storage available on most phones today so it should be possible to store for AP data for a fairly long time.

There are, of course, little details like detecting moved (and moving) APs but that can be handled too.

As a proof of concept I’ve written a plug-in for the microG project’s unified network location provider (NLP) that builds such a database on phone whenever the GPS is in use and then provides that data to the NLP when the GPS is off or when the GPS has not yet acquired first fix. So far it seems to do exactly as I want and the locations it provides when starting my maps and navigation apps are remarkably accurate. I will track the database size over time and see if it stays a reasonable size. If so I will either alter it so that it can submit data to a project like Open WLAN Map or submit changes to that project’s existing collection apps.

Maybe No Centralized AP Location Repositories Are Needed

Usually you go to places you have been to before so your phone will have learned nearly all of the AP locations of importance to you pretty quickly. And if you are going to a new place you are likely to use a navigation app which will have the GPS on. So you will be collecting AP data along the way. And the data at your destination will be high quality because you will be moving slowly as you park. Later, when ready to leave, that new AP data is available to give you a fast GPS time to first fix for your navigation app.

It seems that building a personal AP database on the fly for areas you visit and having a fall back to a reasonable cell tower database may cover the majority of use cases. And you know exactly where your data is and how it is being used.