Exploring browser history using Python's sqlite3 module (plus: finding the database with proc or lsof)

Recently I've been trying different search engines like DuckDuckGo and Kagi. I started wondering how how often I'm falling back to Google. Am I habitually using Google search? Or perhaps not finding what I want via other search engines?

Rather than looking in the History tab, I decided to grab the data myself. Perhaps I'd visualise the data or do something interesting with it. (Spoiler: I didn't. But I did play around with proc and Python's sqlite3, the subjects of this post.)

Fair warning: this post isn't a "how to". I like reading about how someone worked something out and this post is in that style. Essentially it's "How I learned a little about how my browser stores data".

Finding the browser history

I don't usually pay attention to my browser history and had no idea where it's stored. Luckily, Mozilla explains where Firefox stores all sorts of data, from bookmarks and history to your personal dictionary and toolbar customisations.

On Ubuntu, the Firefox history is stored in ~/.mozilla/firefox/$your_profile_directory/places.sqlite where $your_profile_directory has a value specific to you, but probably includes the word "default".

Exploring the history database using sqlite3

Next challenge: how do I query a SQLite database using Python?

I was surprised to realise that I didn't know this. (If I'm working with a database in Python, usually I'm using Django.) I had vague memories of using a sqlite package, though not the last time I used it. Turns out it's the sqlite3 module in Python's standard library.

The official docs are mostly a description of the API but also contain useful advice. At the end is "Using sqlite3 efficiently". I didn't read that right away, so I spent a while manually fiddling with Cursor objects instead of letting sqlite3 do it for me. The section has other useful tips and is well worth reading.

Anyway, time for a bit of database exploration.


import sqlite3
from pathlib import Path

# In my case, the path was approximately
FIREFOX_PATH = Path.home() / ".mozilla/firefox/fi3dmgd.default-release/places.sqlite"
db = sqlite3.connect(FIREFOX_PATH)

# What tables do I have?
db.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall()

# Lots of tables... moz_historyvisits looks promising. What's in it?
db.execute("SELECT * FROM moz_historyvisits").fetchmany(3)

# Hmm. That only shows data (row content). What are the columns?
db.execute("SELECT name from PRAGMA_TABLE_INFO('moz_historyvisits');").fetchall()

# No URL, but there is a place_id. Presumably a foreign key to the places table. What's in that?
db.execute("SELECT name from PRAGMA_TABLE_INFO('moz_places');").fetchall()

# Found the URL. (Plus some surpising things like rev_host, the host name stored reversed.)
# So I need something like this:
db.execute(
    """
    SELECT visit_date, url FROM moz_historyvisits
    LEFT JOIN moz_places on moz_historyvisits.place_id = moz_places.id
    WHERE moz_places.url LIKE 'https://kagi.com%' OR moz_places.url LIKE 'https://duckduckgo.com%' OR moz_places.url LIKE 'https://www.google.co%/search?%'
    """
)

db.close()

It took a while to work out how to list tables or columns. A lot of information about SQLite assumes the reader is using SQLite's CLI, but convenience commands like .tables don't exist in the Python sqlite3 module. SQLite does, however, have PRAGMA statements, an extension to normal SQL. They mostly seem to let you query database metadata.

I also tweaked the Google match string a couple of times. Initially it was too broad, matching pages from Google Maps and other bits of the Google ecosystem. In a different way, the first version was too narrow because I wanted results for google.com and google.co.uk.

More importantly: this code isn't production-ready. It doesn't handle errors and because I was doing simple, one-off SELECTs, I didn't care about things like transactions or rollback behaviour.

Finding the browser history, proc edition

That was Firefox done, but I've also been trying the Vivaldi browser. Rather than look up where Vivaldi stores its data, I decided to work this one out myself.

Thinking through options, find didn't seem promising because I didn't know what the file would be called or whether it would definitely be another SQLite database.

Another option was strace, which seems incredibly useful whenever Julia Evans writes about it. Connecting to the Vivaldi process with strace proved pretty overwhelming and I still wasn't sure what I was looking for.

But I figured that Vivaldi probably accesses the history a lot. And it probably needs other config and data files, which are likely to be in a directory near wherever the history is stored. All of which meant I could probably find the history by looking at the files Vivalidi had open.

Enter /proc, which exposes a great deal of information about the OS, including information about running processes. I knew that /proc/some_pid/fd would show me the files (fd = file descriptors) the process has open. So first I needed to find the Vivaldi process ID (pid).


$ ps -ef | grep vivaldi
ellen   27852       1  1 08:37 ?   00:00:11 /opt/vivaldi/vivaldi-bin --enable-crashpad
ellen   27865   27852  0 08:37 ?   00:00:00 /opt/vivaldi/vivaldi-bin --type=zygote --no-zygote-sandbox --enable-crashpad --crashpad-handler-pid=27859 --enable-crash-reporter=,stable --change-stack-guard-on-fork=enable --enable-crashpad
ellen   27866   27852  0 08:37 ?   00:00:00 /opt/vivaldi/vivaldi-bin --type=zygote --enable-crashpad --crashpad-handler-pid=27859 --enable-crash-reporter=,stable --change-stack-guard-on-fork=enable --enable-crashpad
ellen   27868   27866  0 08:37 ?   00:00:00 /opt/vivaldi/vivaldi-bin --type=zygote --enable-crashpad --crashpad-handler-pid=27859 --enable-crash-reporter=,stable --change-stack-guard-on-fork=enable --enable-crashpad
# ... Lots more output skipped for brevity
        

There were a lot more Vivaldi processes than I expected. But now I knew the process was called vivaldi-bin and the parent (main) process had pid 27852. (The number next to 27852 is the pid of its own parent, in this case 1, aka systemd, which starts programs.)

(If you're wondering how I know this, it's from using commands like pstree -ps some_pid, which shows the full tree of processes for the specified pid, including systemd which is the root of the tree.)

If I'd been more observant at that point, I could have answered my question "Where's the browser data?" One of the Vivaldi processes runs with the option --database=~/.config/vivaldi/Crash Reports and it seems reasonable to guess the history database is stored somewhere near the Crash Reports. But I didn't notice that at the time.

Instead, I took the longer route and used /proc


$ ls -l /proc/27852/fd

# Most output skipped for brevity
total 0
lr-x------ 1 ellen ellen 64 May 25 14:26 0 -> /dev/null
l-wx------ 1 ellen ellen 64 May 25 14:26 1 -> 'pipe:[52909]'
lrwx------ 1 ellen ellen 64 May 25 14:26 10 -> 'socket:[61453]'
lrwx------ 1 ellen ellen 64 May 25 14:26 158 -> /home/ellen/.config/vivaldi/Default/History-journal
lrwx------ 1 ellen ellen 64 May 25 14:26 47 -> /home/ellen/.config/vivaldi/Default/History
      

Using the file command confirmed that "History" was an SQLite database, so I figured that was probably the file I wanted.

This post is already long, so I won't show code snippets exploring this database. The short version: there's a table called "visits", which has the visit time and a foreign key to a table called "urls". The "urls" table contains the URL itself and the page title.

Finding the browser history, lsof edition

Shortly after I'd found the file using /proc, I realised that I could have achieved the same thing using the command lsof (ls open files), which would have looked something like this.


$ ps -ef | grep vivaldi
ellen   27852   1  1 08:37 ?   00:00:11 /opt/vivaldi/vivaldi-bin --enable-crashpad

$ lsof -p 27852 | grep REG
vivaldi-b 1306111 ellen  mem-w    REG    8,2   1802240  9965138 /home/ellen/.config/vivaldi/Default/History
vivaldi-b 1306111 ellen   69uw    REG    8,2   1802240  9965138 /home/ellen/.config/vivaldi/Default/History
vivaldi-b 1306111 ellen  152u     REG    8,2      8720  9965158 /home/ellen/.config/vivaldi/Default/History-journal
# As ever, lots of output skipped for brevity
      

The -p flag shows all the open files for the specified pid. That includes a lot of things like sockets, which I didn't care about. The REG type are "regular files", so I grepped that.

I haven't used lsof much, but I suspect it's usually a better option than directly accessing /proc. Whoever wrote lsof probably knew a lot more about the Linux file system than I do and knew more of the edge cases. But it's handy to know what data lsof is accessing.

Results

As for my search engine experiment, the results aren't interesting yet. Since signing up for the Kagi beta, it seems to be my default search engine. It's possible that's because it's the newest and shiniest search engine. But it also suggests that I haven't had much reason to fall back to Google.

Of course, I could have more easily learned that from the History panels of the various browsers. But it's fun to see how things work behind the UI and I got to play a little with the sqlite3 module.

Useful things

The Python docs all-too-easy-to-miss advice for using the sqlite3 module: Using sqlite3 efficiently.