Why normalize The Data

Normalizing a database gratly reduces redendancy in data and can allow you to save disk space which is extremly important when storage resources are limited. Performing database normalization will optimze data storage, and allow for easier analysis of the meter transaction data locally.

Testing Normalization

all transaction data for one day has been inserted into th database. Ocne as a regular table and another time within a normalized database model. The below query will get the size of the normalized schema and non-normalized table(original data) in bytes, and compare the sizes.

    SELECT NORMALIZED_SCHEMA "Normalized Schema",
	NON_NORMALIZED_TABLE " Non-normalized Table",
	ROUND(((NON_NORMALIZED_TABLE - NORMALIZED_SCHEMA) / NON_NORMALIZED_TABLE) * 100,
		2) "Difference (%)"
    FROM
	(SELECT SUM(PG_RELATION_SIZE(QUOTE_IDENT(SCHEMANAME) || '.' || QUOTE_IDENT(TABLENAME))) NORMALIZED_SCHEMA
		FROM PG_TABLES
		WHERE SCHEMANAME = 'sf_ticket_trans') STAT1,

    (SELECT PG_RELATION_SIZE('public.data_08012022') NON_NORMALIZED_TABLE) STAT2	

Output:

Normalized Schema Non-normalized Table Difference (%)
6979584 7733248 9.75

From the above output we can see that we have been able to save almost 10% of space through normalization. While that is not currently a huge difference, as more data for each day is added to the databaase model we can save 10% of kilobytes , megabytes, and so on.

Where Can We Normalize

In order to have a normalized databade model we need to review datasets attributes and identify opportunities to reduce redundancys.

transmission_datetime column is clearly the unique identifier/primary key for the transaction data with the column being 100% uniue. The majority of columns have a low percentage of unique values, giving us a clear answer on if they are worth normalizing into dimension tables.

We do however have 2 datetime columns that have a high percentage of unique values This is due to the datetime date type Having a combination of days, hours, minutes, and seconds it is easy to have several thousand unique values.

However hardly 50% of the total balues are unique for one day, and with the plan to insert data for at least the month of August, we can easlily begin to start seeing more redundacny. It could be interesting and fun to see how we attempt to normalzie the data.

Normalization Methods

Attribute Normalization Method
post_id Create dimension table for unique meter post id's
street_block Create dimension table for unique street blocks
session_start_dt
  • split into separate date and time column
  • reduce redundant unqique values by stripping second
    from time and grouping time by 5 minute intervals
  • create dimension table for unique times
session_end_dt
  • split into separate date and time column
  • reduce redundant unqique values by stripping second
    from time and grouping time by 5 minute intervals
  • create dimension table for unique times
gross_paid_amt
  • round currency float by nearest .50 to reduce redunant unique
    values/create grouping
  • create dimension table for unique payment ammounts

Model Data

Below is an entity relationship diagram which delineates our fact table and dimension tables. effectively showing how data will be normailzed and how the tables will interact with each other vbased on the aformentioned normalization.

Database

  • create schema
  • load data

Execution - Retrieving & Normalizing

The below diagram outlines the entire data process:

below is a log of the data which was retrieved via api by date. The data is logged once retreival and insertion into the database is complete.

1819348 rows of data for the entire month of August, 2022 is now ready to be freely analyzed with data being stored locally with the most minial ammount of disk space being used.

Data Date Rows Execution Time(minutes)
0 2022-08-01 65382.0 1.89
1 2022-08-02 67550.0 2.97
2 2022-08-03 69231.0 2.59
3 2022-08-04 68049.0 2.38
4 2022-08-05 69557.0 2.69
5 2022-08-06 67796.0 3.01
6 2022-08-07 4689.0 0.24
7 2022-08-08 65700.0 2.60
8 2022-08-09 67161.0 2.37
9 2022-08-10 68173.0 3.23
10 2022-08-11 67634.0 12.37
11 2022-08-12 71181.0 3.86
12 2022-08-13 68853.0 2.33
13 2022-08-14 6186.0 0.23
14 2022-08-15 65764.0 2.66
15 2022-08-16 68726.0 2.77
16 2022-08-17 69201.0 2.82
17 2022-08-18 67794.0 2.80
18 2022-08-19 71052.0 2.95
19 2022-08-20 63106.0 2.54
20 2022-08-21 4489.0 0.19
21 2022-08-22 65905.0 2.71
22 2022-08-23 67687.0 2.67
23 2022-08-24 68177.0 2.79
24 2022-08-25 67286.0 2.65
25 2022-08-26 64411.0 2.55
26 2022-08-27 67914.0 2.75
27 2022-08-28 132.0 0.01
28 2022-08-29 51754.0 1.98
29 2022-08-30 62497.0 2.39
30 2022-08-31 66311.0 2.87

Future Updates

Python

  • update script to allow for user input for date span/ranges
  • log size of all original data combined and compare to entire normalized schema
  • creating non materialized database views for data analysis

Categories:

Updated: