SQLite gives us the ability to create a simple relational database model which can be perfect for quickly getting data projects off the ground. However for any long-standing project, infrastructure needs to scale with demand and the data itself. The tables in your model will grow overtime and so will your data management needs. Enter PostreSQL! Postgres is a relational database maangement system that offers many features.
In this projectI migrate an entire SQLite database into PostgreSQL and explore the use of Views, which is a common but powerful feature.
Database Migration
Load sqlite database into Python
file into python environment using sqlite3 library
#connect to sqlite db file
sqlite_conn = sqlite3.connect("../sqlite/database.sqlite")
sqlite_cur = sqlite_conn.cursor()
#query all table names
tables_df = pd.DataFrame(sqlite_cur.execute("SELECT name FROM sqlite_master WHERE type='table'"))
tables_df.columns = [i[0] for i in sqlite_cur.description
tables in SQLite database
Query all data into dataframes
from the loaded data model and convert into dataframes
#create dictionary with table names that will hold table data
df_dict = {'reviews':None,'artists':None,'genres':None,'labels':None,
'years':None,'content':None}
for i in list(df_dict.keys()): #iterate through dictionary
df_dict[i] = pd.DataFrame(sqlite_cur.execute(f'SELECT * FROM {i}')) #query each table and create dataframe object
df_dict[i].columns = [i[0] for i in sqlite_cur.description] #get header/columns
Reviewing Dataframes sample
Load dataframes into Postgres database as tables
use pandas “to_sql” attribute to crete tables and load data into database via the defined database connection .
#create sqlaclechemy database connection with newly created database
database = f"postgresql+psycopg2://{config.db_user}:{config.db_password}@localhost:5432/pitchfork?gssencmode=disable"
engine = create_engine(database)
for i in list(df_dict.keys()): #iterate through dictionary taht holds dataframes of data
df_dict[i].to_sql(name=i, index=False, con=engine, if_exists='replace', chunksize=100000) # load data into database
print(f"Table {i} loaded")
Database Design: Non-materialized Views
With the data now fully loaded, we can query it dirctly through the database managemnt system. As well as use tools native to the environment, succh as ERD Diagrams. Which allow an easy overhead view of the tables and its columns:
Database genrated ERD Diagram
With confirmation that our data has been loaded in the same format as it was held in the sqlite db file, we can now create some views.
Creating Views
-- create a view for reviews with highscores
CREATE VIEW high_scores AS
SELECT * FROM reviews
WHERE reviews.score > 9
Querying "high_scores" View
--create view for top genres by review count
CREATE VIEW top_genres_by_reviews AS
SELECT genres.genre AS "Genre", COUNT(reviews.reviewid) AS "Review COUNT" FROM
genres
LEFT JOIN
reviews
ON genres.reviewid = reviews.reviewid
WHERE Genre IS NOT null
GROUP BY genres.genre
ORDER BY genres.genre DESC
Querying "top_genres_by_reviews" View
--create view for top 10 artists by review count for each year
CREATE OR REPLACE VIEW ranked_artists_y AS
with ranked_artists AS (
SELECT pub_year, artist, COUNT(reviewid) "Review COUNT",
ROW_NUMBER() OVER(
partition by pub_year
order by count(reviewid) DESC ) row_num
from reviews
group by pub_year, artist )
select pub_year, artist, "Review Count", row_num
from ranked_artists
where row_num between 1 and 10
Querying "ranked_artists_y" View
With views created its easy to gain insights and answer analytic questions as more data is added.