This “Your iPhone Is Tracking Your Every Move!!” craziness just won’t go away. I’ve been kind of disappointed by the lack of very detailed analysis of the data that’s actually being collected, so I spent some time collecting information of my own.
I have access to four iOS devices running 4.0 or better: my personal iPhone 3GS, a family iPad with 3G subscription, a company-owned iPad (whose 3G has never been activated), and just arrived an iPad 2 that belongs to a client. So I spent some time this weekend trying to better understand what the Core Location daemons are doing.
First, please forgive me if I’m retreading already explored ground. Turns out that a few other people did the same thing this weekend, and so maybe I’m late to the party. I don’t want to be a “Me, too!” poster, but I also think there’s a little that I’ve found that I haven’t seen mentioned yet. Plus, I should mention the work of Alex Levinson, who looked at this in detail a year ago and has been a solid voice of reason from the beginning.
Anyway, first I’ll talk about some what I observed, then I’ll see if I can’t draw a few (hopefully valid) inferences. Some of the data were taken from the devices just as they were last week. Saturday, though, we went out to lunch and I took my phone, company iPad, and personal iPad all with me. During that trip, I kept the personal iPad locked the entire time, and I used the company iPad on the road (with Google Maps open the whole way). I used my phone briefly to make a call, and checked twitter a couple times while at the restaurant, and also for a while in a parking lot as my wife went into the grocery store.
First, the database.
I can see 5 tables within the consolidated.db that seem to be pertinent: CellLocation, CellLocationLocal, CellLocationHarvest, WifiLocation, and WifiLocationHarvest. All of these include details about speed, accurracy, elevation, and other such items that I’m not really concerned with (and many of which don’t seem to be used, at any rate). All also include a timestamp, latitude, and longitude, as well as some way of uniquely identifying the point it represents. In the case of a Wi-Fi access point, this is the MAC address, and in the case of a cell tower, it’s a tuple of four data items. Each entry in these tables appears to be unique — that is, no single cell tower or Wi-Fi access point appears more than once. Point 1: The devices are not tracking my every movement.
Now, my phone.
I see several access points noted all around my house. The accuracy isn’t phenomenal, as it puts my access point on my deck, and a neighbor’s in the middle of my kitchen. In fact, there are 11 different access points displayed either in my house, my yard, or just into my neighbors’ yards. Point 2: The Wi-Fi data points are not precisely located.
Also, the timestamps are varied. Four of the 11 around my house show a date/time from a couple days before I dumped the database (and another 4 are stamped two seconds later). But the other three are from early March, late February, and mid January. Point 3: The Wi-Fi data does not represent the last time I visited a location.
Finally, huge swaths are blanketed with data about Wi-Fi access points. Neighborhoods I’ve not driven through in months, if not years (or ever). These points share similar timestamps as the data within my neighborhood. Point 4: Data is present in the database for locations I’ve not visited.
The cell tower data is very similar. It shows towers located in areas I’ve not recently visited, with locations not corresponding to actual towers (in many cases, not even close — several were shown in residential communities where I’ve never seen a tower). The timestamps are similarly varied, with some I randomly clicked on going back to October 2010. Point 5: Cell tower data is treated the same as Wi-Fi access point data.
I did not see any new data points appear during the drive to the restaurant, or while we ate. However, a batch of data, both Cell and Wi-Fi, was timestamped while we sat outside the grocery store. The cell data, in particular, was scattered over a very wide area, at least several miles on a side. Point 6: Data appears for a wide area simultaneously, and is not necessarily tied to length of time sitting still.
Finally, I observed new data in the WifiLocationHarvest table. A total of 11 Wi-Fi access points were simultaneously recorded while I waited in the parking lot. The precision on this was pretty good — only about 50 feet from where I was sitting. Points 7 and 8: Actual recording of new data is not predictable, and is highly accurate.
I was also able to look at some past data on the phone. I took a one-day trip to Dallas at the end of March, and found large collections of data centered on the location I’d visited, the area I ate lunch, and three locations on the highway leading from the airport. Those locations roughly, I believe, correspond with times when I’d refreshed Google Map directions. Point 9: You may be able to force a data fetch by refreshing the maps application.
My family iPad, which I’d woken up before we left and promptly locked again, did not record any new data the entire time. Point 10: When locked, the device might not record anything at all.
The company iPad was in use the whole way to the restaurant. It has no record of any cell towers, which isn’t terribly surprising, since it does not have an active 3G data plan (though it does have the 3G hardware). Point 11: No data plan, no cell info.
Obviously, since there was no data plan, it couldn’t collect any new data along the way. However, as we left the grocery store, I unlocked the device, refreshed the map location, and locked it again. Once we’d returned home, the iPad fetched 394 Wi-Fi points, in an area about a 1/2 mile by 1/2 mile square, roughly corresponding to the place we were when I refreshed the map. All these data points were timestamped when they were fetched — that is, when the iPad had access to the Wi-Fi at home — not when I was actually on the road. Point 12: The device may cache your last request and fetch related data the next time a network is availble.
All three iPads showed a curious distribution of points around my office. The customers’s iPad, which has only been to the customer facility and my office, displayed points in a very short and wide rectangle centered on my office. My family iPad, which has only been a few placed since I loaded 4.0 on it, showed virtually the same distribution around the office and a similar distribution, but not as wide, around my house. Not all of these points had the same timestamp, but over time, it definitely started filling out that shape. Point 13: When fetching data, the device appears to collect points over a nearly-fixed vertical range (about 30 arcseconds of Latitude) and a variable horizontal range.
Finally, my wife had taken the family iPad on a short trip last weekend. The iPad showed a square burst of Wi-Fi data points about where she pulled over to check a map, and another wide rectangle around the hotel she stayed in. It also showed data in the CellLocationLocal table. That table showed her track along the interstate, and appeared to be an actual positional track. Interestingly, the CellLocation table did not have tower locations for virtually anywhere along that track. On my phone, I had two points from my Dallas trip, and a half-dozen points from a taxi ride into Manhattan a week prior. Point 14: The CellLocationLocal table may record actual trip data, but it appears to be very limited.
One further point of (potential) interest: The timestamps on the data were, if you’ll pardon the pun, all over the map. Many data sets had timestamps only a few seconds or minutes apart. But when I stripped out data sets that were within five minutes of another set of points, the average time between updates was about 14 hours. Note that there’s very little stastical rigor to this, but I thought it was interesting. Point 15: When the device spends an extended time at one place, data appears to be fetched about twice a day.
Summary of Observations
So, to sum up, here are my observations thus far:
- Point 1: The devices are not tracking my every movement.
- Point 2: The Wi-Fi data points are not precisely located.
- Point 3: The Wi-Fi data does not represent the last time I visited a location.
- Point 4: Data is present in the database for locations I’ve not visited.
- Point 5: Cell tower data is treated the same as Wi-Fi access point data.
- Point 6: Data appears for a wide area simultaneously, and is not necessarily tied to length of time sitting still.
- Points 7 and 8: Actual recording of new data is not predictable, and is highly accurate.
- Point 9: You may be able to force a data fetch by refreshing the maps application.
- Point 10: When locked, the device might not record anything at all.
- Point 11: No data plan, no cell info.
- Point 12: The device may cache your last request and fetch related data the next time a network is available.
- Point 13: When fetching data, the device appears to collect points over a nearly-fixed vertical range (about 30 arcseconds of Latitude) and a variable horizontal range.
- Point 14: The CellLocationLocal table may record actual trip data, but it appears to be very limited.
What does all this tell us? I think we can infer at least a few things, which are consistent with what others have been saying, and with Apple’s statements last year.
- The data in WifiLocation and CellLocation are not your device’s actual location at any given point in time, but instead are the location of others’ Wi-Fi access points and cell towers.
- The location of these points are estimated by Apple based on data harvested by iOS devices and provided to Apple on a periodic basis.
- Individual devices periodically record the Wi-Fi points and cell towers visible to them, record a precise location, and send that data to Apple. (I have not yet observed this happen, but it makes sense, and Apple’s already said as much).
- Periodically, the device will poll Apple’s servers for location information nearby. This seems to happen when the device has been at rest for some time, or when the location information is refreshed in the map application (it may be reasonable to expect that other applications querying the Core Location service may also trigger a refresh). There may be some logic in terms of what data gets fetched, perhaps to avoid downloading duplicate information. I haven’t been able to dig into that yet.
- The timestamp for the fetched data appear to be the time the data was fetched. One may be able to look in the middle of a set of identically-stamped data to infer where the user was when that data was fetched. However, the data don’t appear to be fetched every time you’re in any given location, even if you’re there for an extended time (like, say, lunch).
So what’s my conclusion? I’m still not sure about the CellLocationLocal table, which perhaps might be for recording locations for future data fetches. But the rest of the data all seem very consistent with what Apple’s told us: they’re used to aid in geolocating the device. Why are so many points stored? So that it won’t have to pull data down again in the future. It’s a big, personalized cache, made to make my personal use of geolocated features faster and more accurate.
[Note -- if you're interested in the python script I used to load the data into Google Earth, I'm posting it on the Intrepidus Group blog. It should be attached to this post from last week about my first review of the data.]
In 2009, the Verizon Business Risk Team released their first public Data Breach Investigations Report. I saw it reasonably soon after release, and noticed a whole bunch of binary numbers in the background on the cover. “Cool,” I thought, but I didn’t bother trying to decode it. A week or so later, I learned that there’d been a contest, and I missed out. :(
In 2010, I was ready, and tried to solve the puzzle, but failed. That story comes later.
But now, on the eve of the release of the 2011 DBIR, I’m finally documenting the method needed to solve these puzzles. Here’s a quick, fresh look at the 2009 puzzle.
As always, if you’d like to try to solve this yourself, then STOP now, as the rest of this post is full of spoilers. If you’d like a copy of just the raw data (in this case, two ciphertexts), click here.
So, I vaguely remembered how this worked. And also that it was a very simple puzzle. Let’s see how quickly I can solve it, without digging too deep into my memory for what needed to get done. First, I pulled down the original PDF. And there, all over the background of the cover, is a whole bunch of binary numbers. Highlight, copy, and paste out into a file.
First, how do we break up the numbers? 8-bits? 7-bits? I removed the line breaks, counted, and divided by 8, but didn’t get an even number of bytes. Found some text in the middle, removed that. Now count again — ah, beter. Looks like it’s 900 8-bit characters.
Next up — a simple script to decode the binary. Doesn’t take more than a few minutes, and now I’ve got a big block of ciphertext.
Okay, what kind of encryption did they use? A quick test of ROT-13 and such doesn’t get me anywhere. It’s awfully long, so I really don’t want to try a substitution cipher if I can avoid it. Then I remember that there was a clue somewhere in the report. Skimming through, I found a footnote on page 48:
yr puvsser vaqrpuvssenoyr
Let’s run that through ROT-13, and sure enough, we get a hint:
le chiffre indechiffrable
Aha! That’s French. And one of the most commonly found ciphers, it seems, for hacker crypto challenges was created by a Frenchman. And I also know, because of how often I’ve run against this cipher, that he called it “le chiffre indechiffrable” (or the indecipherable cipher). So which cipher it is has been decided: it’s a Vigènere.
But how long is the key? I found an online applet that would do a Kasiski analysis, which looks for repeated trigraphs in the cipher and measures the distance between them. If you can find a common factor amongst a bunch of repeated trigraph distances, that could very well be the key length. I found 10 repeated trigraphs, but their distances are all over the map, and I can’t see anything that’s a clear common factor.
Next up, the index of coincidence, which is a way of looking at the ciphertext in varying keylengths to see which one seems to have “slices” that are the most internally consistent. That’s a simplification. Truth is, I don’t understand it much beyond a zen-like vagueness, so I’m not going to try to explain it here.
Anyway, the IC applet makes 9 characters look like a good potential key length, though it’s far from certain. But at least one of the Kasiski distances was 72, which is a multiple of 9, and so this is as good a place to start as any.
Next up, I stick the ciphertext into a nice interactive Vigenere applet, set it to a 9-character key, and start sliding the alphabets around to see if anything pops out. Not having anywhere better to start, I make a guess that the plaintext starts with “CONGRATULATIONS.” As I adjust the various alphabets to make this happen, the key starts to appear. C-H-A-N-G-I-N-E. Hm. So close. Let’s change it to CHANGING…and now it’s looking a lot more real. Here’s the beginning of the plaintext:
Except that this is the only plaintext I see. If it were a 9-character key, then a key of “CHANGINGA” would at least give me 8 characters of real text, repeated down the length of the output, with a junk character in between each. At this point, I could think back, remember that the key was actually present in the text, find the two instances of “changing” in the report and have the puzzle solved in less than 30 minutes total. But that’d be cheating. So let’s try something new.
It’s looking pretty likely that the key starts with “CHANGING.” But I don’t know how many characters come next. I didn’t see a repeat at 8 or 9 characters, so let’s add another A, and another, and another, until I see things repeat. Once I get to 26 characters it happens. Now I’ve got plaintext that starts like this:
So now, let’s start changing the letters after CHANGING and see what happens. A is no good, neither is B, nor C, but D — that seems to extend the cleartext words properly. In fact, the ZZZ after GOTO are probably supposed to be WWW. To make that happen, my key now starts with “CHANGING DEF”, which gives me this:
From here, it’s a pretty easy job to finish out the key this way. The result is “Changing default credentials.” (it also appears in the report as “Changing default credentials is key.” Is. Key. Heh. Funny.) The final plaintext tells where to write with your solution, and the rest is a terse, high-level summary of the entire report. Here it is with spaces and newlines entered for clarity.
FIRST TO CRACK GETS REWARD
GO TO WWW VERIZONBUSINESS COM SLASH DBIRHUNT TO CLAIM
FOR EVERYONE ELSE HIGH LVL STATS FOR FIN SVCS AND RETAIL FOLLOW
SOURCES EXTERNAL NINETEEN INTERNAL NINE PARTNER TWO
THREATS MALWARE ELEVEN HACKING FIFTEEN DECEIT FOUR MISUSE SIX PHYSICAL TWO ERROR ONE
ERROR SIG CONTRIBUTOR IN FIFTEEN
TOP THREE HACK TYPES SQL INJECTION SEVEN MISCONFIG ACLS SEVEN DEFAULT CREDS TWO
TOP HACK VECTOR IS WEB APP
TEN TOP ASSET IS ONLINE DATA TWENTY SIX AND
ALL RECORDS TOP THREE DATA TYPES AUTH CRED ELEVEN PII TEN PYMNT CARD EIGHT
PYMNT CARD WAS NINETY EIGHT PCT OF RECORDS
TOP UU IS UNKNOWN CONNECTIONS SEVEN
DISCOVERY TAKES WEEKS TO MONTHS
EXTERNAL TWENTY THREE INTERNAL ONE PARTNER EIGHT
THREATS MALWARE TEN HACKING TWENTY ONE DECEIT TWO MISUSE TWO PHYSICAL ZERO ERROR ZERO
ERROR SIG CONTRIBUTOR IN SIXTEEN
TOP TWO HACK TYPES SQL INJECTION SEVEN STOLEN CREDS SEVEN
TOP HACK VECTOR IS REM ACCMGT EIGHT
TOP ASSET IS POS ELEVEN AND
OVER HALF OF RECORDS TOP TWO DATA TYPES PAYCARD TWENTY THREE PII NINE
DISCOVERY TAKES MOSTLY MONTHS
My memory was correct in one respect — this was a very simple puzzle. Even the long approach I took, once I’d figured it out, went fast. If I’d received this puzzle new, today, I’m sure I would have solved it in an evening, tops. Two years ago, I almost certainly wouldn’t have been so lucky. For one, the trick of padding out the potential key to look for repeats isn’t something that’d ever occurred to me before, that I can recall, though it’s pretty obvious in retrospect. I’ll definitely have to remember this technique for future puzzles.
Also, having “CONGRATS” as the opening word gave me a really easy crib. Without that, I honestly don’t know where I’d have started.
So though I was right, this was a simple puzzle, I was wrong in another key respect: That its simplicity would mean it wasn’t going to be any fun, especially (subconciously at least) knowing what I needed to do. Learning a new approach to break this cipher was fantastic fun. And proof that even the easy puzzles shouldn’t be ignored.
Thanks to the whole Verizon crew for this one. The 2010 puzzle was a different story, but that’ll wait until later. Hopefully I’ll write that up before next week’s new puzzle starts sucking up all my free time…