Analyzing my Google Location History

语言: CN / TW / HK

I recently readthis article on how to create a heat-map from your Google Location History data. Trying it out myself, I got some amazing results:

The large red circles represent the cities I have spent a significant amount of time. The purple shade in different places represents places I had travelled to or passed through while on a train journey.

My previous phone had some GPS issues, which was leading to my location being shown in Arizona, USA!. Surprisingly (or not?) it even gave a proof of that!

All this was really cool to look at, but I really wanted to dive in and get some more insights into my travelling patterns throughout the years.

Data Pre-Processing

Like most data science problems, data pre-processing was definitely the pain point. The data was in a JSON format where the meaning of different attributes wasn’t very clear.

Data Extraction

{'timestampMs': '1541235389345',
  'latitudeE7': 286648226,
  'longitudeE7': 773296344,
  'accuracy': 22,
  'activity': [{'timestampMs': '1541235388609',
    'activity': [{'type': 'ON_FOOT', 'confidence': 52},
     {'type': 'WALKING', 'confidence': 52},
     {'type': 'UNKNOWN', 'confidence': 21},
     {'type': 'STILL', 'confidence': 7},
     {'type': 'RUNNING', 'confidence': 6},
     {'type': 'IN_VEHICLE', 'confidence': 5},
     {'type': 'ON_BICYCLE', 'confidence': 5},
     {'type': 'IN_ROAD_VEHICLE', 'confidence': 5},
     {'type': 'IN_RAIL_VEHICLE', 'confidence': 5},
     {'type': 'IN_TWO_WHEELER_VEHICLE', 'confidence': 3},
     {'type': 'IN_FOUR_WHEELER_VEHICLE', 'confidence': 3}]}]},
 {'timestampMs': '1541235268590',
  'latitudeE7': 286648329,
  'longitudeE7': 773296322,
  'accuracy': 23,
  'activity': [{'timestampMs': '1541235298515',
    'activity': [{'type': 'TILTING', 'confidence': 100}]}]

After researching a bit, I stumbled upon this article , which cleared a lot of things. However, there are some questions that still remain unanswered:-

  • What does the activity type tilting mean?
  • I assumed confidence to be the probability of each task. However, often they do not add up to 100. If they do not represent probabilities what do they represent?
  • What is the difference between the activity type walking and on foot  ?
  • How can Google possibly predict whether the activity type between IN_TWO_WHEELER_VEHICLE vs IN_FOUR_WHEELER_VEHICLE  ?!

If anyone has been able to figure it out, please let me know in comments.

Assumptions

As I continued to structure my pre-processing pipeline, I realized I will have to take some assumptions to take into account all the attributes of the data.

  • The GPS is always on (A strong assumption which is later taken care of).
  • The confidence interval is the probability of the activity type. This assumption helps us take into account various possible activity types for a given instance without under-representing or over-representing any particular activity type.
  • Each log has two types of timestamps. (i) Corresponding to position latitude and longitude. (ii) Corresponding to the activity. Since the difference between two timestamps was usually very minute ( < 30 seconds) I safely used the timestamp corresponding to latitude and longitude for our analysis

Data Cleaning

Remember I told how my GPS was giving Arizona, USA as the location? I did not want those data points to significantly differ the results. Using the longitudinal boundaries of India, I filtered out the data points pertaining only to India.

def remove_wrong_data(data):
    degrees_to_radians = np.pi/180.0
    data_new = list()
    for index in range(len(data)):
        longitude = data[index]['longitudeE7']/float(1e7)
        if longitude > 68 and longitude < 93:
            data_new.append(data[index])
    return data_new

Feature Engineering

Cities for each data-point

I wanted to get the corresponding city for each given latitude and longitude. A simple Google search got me the coordinates of the major cities I’ve lived in i.e. Delhi, Goa, Trivandrum and Bangalore.

def get_city(latitude):
    latitude = int(latitude)
    if latitude == 15:
        return 'Goa'
    elif latitude in [12,13]:
        return 'Bangalore'
    elif latitude == 8:
        return 'Trivandrum'
    elif latitude > 27.5 and latitude < 29:
        return 'Delhi'
    else:
        return 'Other'
data_low['city'] = data.latitude.apply(lambda x:get_city(x))

Distance

Logs consist of latitude and longitude. To calculate distance travelled between logs, one has to convert these values to formats that can be used for distance related calculations.

from geopy.distance import vincenty
coord_1 = (latitude_1,longitude_1)
corrd_2 = (longitude_2, longitude_2)
distance = vincenty(coord_1,coord_2)

Normalized Distance

Each log consists of activity. Each activity consists of one or more activity types along with the confidence (called as probability). To take into account the confidence of measurement, I devised a new metric called normalized distance which is simply distance * confidence

Data Analysis

Now comes the interesting part! Before I dive into the insights, let me just brief on some of the data attributes:-

accuracy:
day:
day_of_week:
month:
year:
distance:
city:

Outlier Detection

There are total 1158736 data points. 99% of the points cover a distance less than 1 mile. The rest 1% are anomalies generated due to poor reception/flight mode.

To avoid the 1% of data causing significant changes in our observations, we’ll split the data into two based on the normalized distance.

This also ensures that we remove the points which do not obey the assumption #1 we made during our analysis

data_low = data[data.normalized_distance < 1]
data_large = data[data.normalized_distance > 1]

Distance travelled with respect to the city

The data for 2018 correctly represents that majority of the time was spent at Bangalore and Trivandrum.

I wondered how distance travelled in Delhi (my hometown) came out to be more than the Goa, where I did my graduation. Then it hit me, I did not have a mobile internet connection for the majority of my college life :).

Travel Patterns in Bangalore and Trivandrum

In June 2018, I completed my internship in my previous organization (at Trivandrum) and joined Nineleaps (at Bangalore). I wanted to know how my habits changed on transitioning from one city to another. I was particularly interested in observing my patterns for two reasons:

  • Since I always had mobile internet while residing in these cities, I expected the representation to be an accurate representation of reality.
  • I’ve spent roughly the same amount of time in two cities, hence the data will not be biased towards any particular city.
Month vs Distance travelled while in Bangalore and Trivandrum
  • Multiple friends and family members visiting Bangalore in the month of October resulted in a huge spike in distance travelled in vehicles.
  • Initially, I was exploring Trivandrum. However, as my focus shifted to securing a full-time data science opportunity, the distance travelled drastically reduced from January to February to March.
  • A huge spike in distance travelled to Bangalore in October can be attributed to multiple family and friends visiting.
  • Vehicle usage is much higher in Bangalore between 20:00–00:00. I guess I am leaving office later in Bangalore.
  • I was walking a lot more in Trivandrum! The difference in walking distance from 10:00–20:00 shows how I was living a healthier lifestyle by taking a walk after every hour or two in office.

Conclusion

There is a lot more (like this and this ) to do with your Location history. You can also explore your Twitter/Facebook/Chrome data. Some handy tips when trying to explore your dataset:

  • Spend a significant amount of time pre-processing your data. It’s painful but worth it.
  • When working with large volumes of data, pre-processing can be computationally heavy. Instead of re-running the Jupyter Cells every time, dump the post-preprocessed data into a pickle file and simply import the data when you start again.
  • Initially, you might miserably fail (like me) in finding any patterns. Make a list of your observations and keep exploring the dataset from different angles. If you ever reach a point wondering whether any patterns are even present, ask yourself three questions: (i) Do I have a thorough understanding of the various attributes of data? (ii) Is there anything I can do to improve my pre-processing step? (iii) Have I explored the relationship between all the attributes using all possible visualization / statistical tools?

To get started you can use my Jupyter Notebook here .

If you have any questions/suggestions feel free to post on comments.

You can connect with me on LinkedIn or email me at k.mathur68@gmail.com.

分享到: