Excel and SQL

Data can be very valuable, and storing it in Excel is a great way to risk making it worthless. Your file can get deleted, become unusably slow, or subtly accumulate errors.

If you decide to keep a lot of important data in a spreadsheet, there's a good chance you'll come to regret it.

SQL vs. Excel

If Excel isn't a good database, why is it so frequently used as one?
Spreadsheets are effective at storing a small amount of data (a few hundred or a few thousand rows), and it's not obvious what goes wrong when you scale past that - at least until you experience it firsthand. Here's a short list.

  • Data Quality
    Databases are easy to backup. You can restore your data if you make a mistake, and if you need to roll back changes, you can do it without hunting down last Tuesday's manually backed-up copy of your spreadsheet.
    Just as importantly, if you need to work with a lot of data, you can do it with one or two commands. You don't need to worry about skipping the last row or having one incorrect formula out of 1,000.
  • Querying Power
    Do you want to know how many extra-large widgets were produced between 12 and 3 pm on every even-numbered day in March? Or find every country with government debt above 90 percent of GDP?
    You can do it with one query, instead of using a bunch of filters and sorts or a manual selection.
  • Multiple users.
    If you have two people editing data in Excel, you can expect three copies of the final spreadsheet. This is manageable with a little data, but it becomes time-consuming and error prone with more. Databases are designed to handle multiple users.
  • Speed and Storage limits.
    When you have hundreds of rows in a spreadsheet, you can see most of your data in a glance and manipulate it on the fly. By the time you get to ten thousand rows, your spreadsheet is unwieldy. You need to maintain pivot tables or other summaries to keep track of everything, and running calculations on the dataset can easily freeze your computer.
    Even if you're just storing data, Excel has a hard limit of one million rows per sheet. A good database will store and process as much data as your harddrive can hold.

There are a lot of other advantages to using a proper database, but these should be enough to get started.

SQL in Excel

One of the easiest ways to connect Excel to a database is with Python. There are Python libraries for just about any type of database - from SQL Server to MySQL, to NoSQL, Salesforce and Redis.
You can use SQLite if you're just getting started; it's easy to use and comes installed with Python. This post uses DataNitro to connect Excel with Python.

Here's a spreadsheet that keeps track of a stock portfolio. Let's move it into SQL.

A source of data for SQL, in Excel
A stock portfolio

We'll add a cell specifying the database name.

Creating a SQL database from Excel data
Loading the portfolio into a database

Now we'll write a short script to create the database:

import sqlite3

conn = sqlite3.connect(Cell("B1").value + '.db')
c = conn.cursor()

c.execute('''CREATE TABLE ''' + Cell("B1").value +
          '''(ticker text, position integer, purchase_price real)''')

portfolio = []

for cell in Cell("A5").vertical_range:

c.executemany('INSERT INTO portfolio VALUES (?,?,?)', portfolio)

Here, we're reading in the portfolio and adding it to a new table in a new database. For simplicity, we're using Cell B1 ("portfolio") as the name of both the database and the table.

Now we need to read the data back. We'll do this by writing a query in Excel, and executing it with Python:

import sqlite3

conn = sqlite3.connect(Cell("B1").value + '.db')
c = conn.cursor()

c.execute(Cell("B2").value) # fetch data
portfolio = c.fetchall()

CellRange("A5:C300").clear() # cleanup existing data
Cell("A5").table = portfolio

This just cleans up the current data, runs our query, and writes the results to the sheet.

Querying SQL from Excel: "Select * From Portfolio"
Reading it back

As we can see, running it gives us back our data. We can run other queries too:

Querying SQL from Excel: "Select * From
				 Portfolio where ticker='GOOG'"
Getting our position in Google.

Querying SQL from Excel: "Select * From
				 Portfolio where position>0"
Getting all long positions

SQL lets us run sophisticated queries without much work. For a big dataset, this is faster, more flexible, and more reliable than using Excel tables.

One issue with SQL is that many people aren't comfortable working with database queries. We can build out a user-friendly interface to SQL.

Generating SQL queries from Excel input
A graphical SQL interface

We'll add dropdowns and blank cells for users to run queries. Here's a code snippet that'll generate a SQL statement from these dropdowns:

query = "select * from " + Cell("B1").value

where = []

if Cell("B2").value != "*":
    where.append("ticker='" + Cell("B2").value + "'")

if Cell("E2").value != "*":
    where.append("position" + Cell("D2").value + str(Cell("E2").value))

if Cell("H2").value != "*":        
    where.append("purchase_price" + Cell("G2").value + str(Cell("H2").value))

if where:
    # add a where statement
    query += (" where " + ' and '.join(where))

c.execute(query) # fetch data

We can replace the line c.execute(Cell("B2").value) above with this bit of code to generate queries.

Finally, we probably want to be able to add data to our database. We'll add a script that updates the db with any new shares we put in:

import sqlite3

conn = sqlite3.connect(Cell("B1").value + '.db')
c = conn.cursor()

c.execute("select * from " + Cell("B1").value) # fetch data
portfolio = c.fetchall()

updates = []

row = 5 # starting row

while True:
    data = tuple(CellRange((row, 1), (row, 3)).value)
    if data[0]:
        if data not in portfolio: # new record
        row += 1
    else: # end of table

c.executemany('INSERT INTO portfolio VALUES (?,?,?)', updates)

This checks the table for any position that's not already in the database, and adds it. We can use it to add shares in Facebook:

Update SQL from Excel
Adding more data

If we run a search, the database returns our updated portfolio.

Updated database data in Excel
Reading the updated data

And of course, we're not limited to using Excel just to read and write data. For example, we can build a dashboard to show how our portfolio is doing:

Adding Excel functions to SQL query results
Calculating PnL

Here, we're using Python to get current stock prices from Yahoo, and adding some basic calculations and conditional formatting to show the status of our positions and our portfolio.
This is a very basic example of what you can do once Excel is connected to a database. Most importantly, it shows that you can offload data storage and querying to a system that's built for it, and use Excel at what it's best it: analyzing and reporting data.

(You can download the portfolio spreadsheet, and all the code in this post, here.)

Is Excel too slow? Speed it up!