AusBiz Consulting

✅Connect to Database and Fetch Data

✅Define SQL query to extract required columns with all non-null entries

✅Connect to database and read query result into a DataFrame as df

In a Colab code cell, run this:

Python

1import pandas as pd
2import matplotlib.pyplot as plt
3from datetime import timedelta
4
5# --------------------------
6# Connect to Database and Fetch Data
7# --------------------------
8
9# Define SQL query to extract required columns with all non-null entries
10query = """
11SELECT
12    job_title,
13    job_posting_time,
14    date_scraped,
15    location
16FROM seek_jobs
17WHERE job_title IS NOT NULL
18  AND job_posting_time IS NOT NULL
19  AND date_scraped IS NOT NULL
20  AND location IS NOT NULL
21"""
22
23# Connect to database and read query result into a DataFrame
24conn = connect_to_db(DATABASE_URL)
25df = pd.read_sql(query, conn)
26
27# Sample Output at this step:
28# df.head()
29# ┌────────────────────────────┬───────────────┬────────────────────┬────────────────────┐
30# │ job_title                  │ job_posting_time │ date_scraped      │ location           │
31# ├────────────────────────────┼───────────────┼────────────────────┼────────────────────┤
32# │ Data Analyst               │ 2d ago        │ 2025-04-28 10:00:00 │ Sydney NSW         │
33# │ Junior Data Scientist      │ 5h ago        │ 2025-04-28 10:00:00 │ Sydney NSW         │
34# │ Machine Learning Engineer  │ 1d ago        │ 2025-04-27 10:00:00 │ Melbourne VIC      │
35# └────────────────────────────┴

✅Normalize Job Titles to Identify Data Roles

✅Define keywords to detect Data-related roles

✅Function classify_roles() to assign a normalized role label based on presence of keywords

✅Apply classification and filter only Data Analyst-related jobs

In a Colab code cell, run this:

Python

1# --------------------------
2#  Normalize Job Titles to Identify Data Roles
3# --------------------------
4
5# Define keywords to detect Data-related roles
6role_keywords = {
7    "Data Analyst": ["data", "analytics"],
8}
9
10# Function to assign a normalized role label based on presence of keywords
11def classify_role(title):
12    title = title.lower()
13    for role, keywords in role_keywords.items():
14        if any(keyword in title for keyword in keywords):
15            return role
16    return "Other"
17
18# Apply classification and filter only Data Analyst-related jobs
19df['Role'] = df['job_title'].apply(classify_role)
20df = df[df['Role'] == "Data Analyst"]
21
22# Sample Output:
23# df[['job_title', 'Role']]
24# ┌────────────────────────────┬───────────────┐
25# │ job_title                  │ Role          │
26# ├────────────────────────────┼───────────────┤
27# │ Data Analyst               │ Data Analyst  │
28# │ Data Scientist             │ Data Analyst  │
29# └────────────────────────────┴───────────────┘

✅Normalize Location and Filter for Sydney

In a Colab code cell, run this:

Python

1# --------------------------
2# Normalize Location and Filter for Sydney
3# --------------------------
4
5major_cities = ["Sydney NSW"]
6
7# Function to extract and validate major city
8def extract_city(location):
9    if "," in location:
10        city = location.split(",")[-1].strip()
11    else:
12        city = location.strip()
13    return city if city in major_cities else "Other"
14
15# Apply location normalization and keep only Sydney roles
16df['clean_location'] = df['location'].apply(extract_city)
17df = df[df['clean_location'] == "Sydney NSW"]
18
19# Sample Output:
20# df[['location', 'clean_location']]
21# ┌────────────────────┬────────────────┐
22# │ location           │ clean_location │
23# ├────────────────────┼────────────────┤
24# │Parramatta,Sydney NSW│ Sydney NSW     │
25# │   Sydney NSW        │ Sydney NSW     │
26# └────────────────────┴────────────────┘

✅Convert "2d ago" or "5h ago" to actual datetime based on scraping time

✅A function called Estimate_posted_time() it takes the scraping date and the posted_duration such 2d ago, 3d ago etc.. so we convert that into the posted date by subtracting from the scraping date .

In a Colab code cell, run this:

Python

1# --------------------------
2# Estimate Actual Job Posting Date
3# --------------------------
4
5# Convert "2d ago" or "5h ago" to actual datetime based on scraping time
6def estimate_posted_time(row):
7    text = row['job_posting_time'].lower()
8    scraped_time = row['date_scraped']
9    if "h ago" in text:
10        hours = int(''.join(filter(str.isdigit, text)))
11        return scraped_time - timedelta(hours=hours)
12    elif "d ago" in text:
13        days = int(''.join(filter(str.isdigit, text)))
14        return scraped_time - timedelta(days=days)
15    else:
16        return scraped_time  # fallback if format not recognized
17
18df['actual_posted_time'] = df.apply(estimate_posted_time, axis=1)
19
20# Sample Output:
21# df[['job_posting_time', 'date_scraped', 'actual_posted_time']]
22# ┌───────────────┬────────────────────┬────────────────────┐
23# │ job_posting_time │ date_scraped      │ actual_posted_time │
24# ├───────────────┼────────────────────┼────────────────────┤
25# │ 2d ago        │ 2025-04-28 10:00:00 │ 2025-04-26 10:00:00 │
26# │ 5h ago        │ 2025-04-28 10:00:00 │ 2025-04-28 05:00:00 │
27# └───────────────┴────────────────────┴────────────────────┘

✅Group posting dates into weeks (start of the week as label)

✅Count job postings per week

✅Assign sequential week numbers and readable labels

In a Colab code cell, run this:

Python

1# --------------------------
2#  Group by Week and Count Postings
3# --------------------------
4
5# Group posting dates into weeks (start of the week as label)
6df['week_start'] = df['actual_posted_time'].dt.to_period('W').apply(lambda r: r.start_time)
7
8# Count job postings per week
9weekly_counts = df.groupby('week_start').size().reset_index(name='num_postings')
10
11# Assign sequential week numbers and readable labels
12weekly_counts = weekly_counts.sort_values('week_start').reset_index(drop=True)
13weekly_counts['week_num'] = weekly_counts.index + 1
14weekly_counts['week_label'] = "Week " + weekly_counts['week_num'].astype(str) + "\n" + \
15    weekly_counts['week_start'].dt.strftime("%d %b - ") + \
16    (weekly_counts['week_start'] + timedelta(days=6)).dt.strftime("%d %b")
17
18# Sample Output:
19# weekly_counts
20# ┌────────────┬───────────────┬───────────┬─────────────────────┐
21# │ week_start │ num_postings  │ week_num │ week_label          │
22# ├────────────┼───────────────┼───────────┼─────────────────────┤
23# │ 2025-03-31 │ 18            │ 1         │ Week 1\n31 Mar - 06 Apr │
24# │ 2025-04-07 │ 24            │ 2         │ Week 2\n07 Apr - 13 Apr │
25# └────────────┴───────────────┴───────────┴─────────────────────┘

✅Bar chart showing weekly counts

✅Add value labels slightly inside bars to avoid overlap with grid lines

✅Add descriptive title with date range

In a Colab code cell, run this:

Python

1# --------------------------
2#  Plot
3# --------------------------
4
5plt.figure(figsize=(14, 6))
6
7# Bar chart showing weekly counts
8bars = plt.bar(weekly_counts['week_label'], weekly_counts['num_postings'], color='orchid')
9
10# Add value labels slightly inside bars to avoid overlap with grid lines
11for bar in bars:
12    yval = bar.get_height()
13    plt.text(bar.get_x() + bar.get_width() / 2, yval - 1, int(yval), ha='center', va='top', fontsize=10)
14
15# Add descriptive title with date range
16min_date = df['actual_posted_time'].min().strftime("%d %b %Y")
17max_date = df['actual_posted_time'].max().strftime("%d %b %Y")
18
19plt.title(f"Weekly Job Postings for Data Roles (Sydney NSW)\n({min_date} – {max_date})", fontsize=14)
20plt.xlabel("Week and Date Range")
21plt.ylabel("Number of Job Postings")
22plt.xticks(rotation=45, ha='right')
23plt.tight_layout()
24plt.grid(axis='y', linestyle='--', alpha=0.6)
25plt.show()

1import pandas as pd 2import matplotlib.pyplot as plt 3from datetime import timedelta 4 5# -------------------------- 6# Connect to Database and Fetch Data 7# -------------------------- 8 9# Define SQL query to extract required columns with all non-null entries 10query = """ 11SELECT 12 job_title, 13 job_posting_time, 14 date_scraped, 15 location 16FROM seek_jobs 17WHERE job_title IS NOT NULL 18 AND job_posting_time IS NOT NULL 19 AND date_scraped IS NOT NULL 20 AND location IS NOT NULL 21""" 22 23# Connect to database and read query result into a DataFrame 24conn = connect_to_db(DATABASE_URL) 25df = pd.read_sql(query, conn) 26 27# Sample Output at this step: 28# df.head() 29# ┌────────────────────────────┬───────────────┬────────────────────┬────────────────────┐ 30# │ job_title │ job_posting_time │ date_scraped │ location │ 31# ├────────────────────────────┼───────────────┼────────────────────┼────────────────────┤ 32# │ Data Analyst │ 2d ago │ 2025-04-28 10:00:00 │ Sydney NSW │ 33# │ Junior Data Scientist │ 5h ago │ 2025-04-28 10:00:00 │ Sydney NSW │ 34# │ Machine Learning Engineer │ 1d ago │ 2025-04-27 10:00:00 │ Melbourne VIC │ 35# └────────────────────────────┴

1# -------------------------- 2# Normalize Job Titles to Identify Data Roles 3# -------------------------- 4 5# Define keywords to detect Data-related roles 6role_keywords = { 7 "Data Analyst": ["data", "analytics"], 8} 9 10# Function to assign a normalized role label based on presence of keywords 11def classify_role(title): 12 title = title.lower() 13 for role, keywords in role_keywords.items(): 14 if any(keyword in title for keyword in keywords): 15 return role 16 return "Other" 17 18# Apply classification and filter only Data Analyst-related jobs 19df['Role'] = df['job_title'].apply(classify_role) 20df = df[df['Role'] == "Data Analyst"] 21 22# Sample Output: 23# df[['job_title', 'Role']] 24# ┌────────────────────────────┬───────────────┐ 25# │ job_title │ Role │ 26# ├────────────────────────────┼───────────────┤ 27# │ Data Analyst │ Data Analyst │ 28# │ Data Scientist │ Data Analyst │ 29# └────────────────────────────┴───────────────┘

1# -------------------------- 2# Normalize Location and Filter for Sydney 3# -------------------------- 4 5major_cities = ["Sydney NSW"] 6 7# Function to extract and validate major city 8def extract_city(location): 9 if "," in location: 10 city = location.split(",")[-1].strip() 11 else: 12 city = location.strip() 13 return city if city in major_cities else "Other" 14 15# Apply location normalization and keep only Sydney roles 16df['clean_location'] = df['location'].apply(extract_city) 17df = df[df['clean_location'] == "Sydney NSW"] 18 19# Sample Output: 20# df[['location', 'clean_location']] 21# ┌────────────────────┬────────────────┐ 22# │ location │ clean_location │ 23# ├────────────────────┼────────────────┤ 24# │Parramatta,Sydney NSW│ Sydney NSW │ 25# │ Sydney NSW │ Sydney NSW │ 26# └────────────────────┴────────────────┘

1# -------------------------- 2# Estimate Actual Job Posting Date 3# -------------------------- 4 5# Convert "2d ago" or "5h ago" to actual datetime based on scraping time 6def estimate_posted_time(row): 7 text = row['job_posting_time'].lower() 8 scraped_time = row['date_scraped'] 9 if "h ago" in text: 10 hours = int(''.join(filter(str.isdigit, text))) 11 return scraped_time - timedelta(hours=hours) 12 elif "d ago" in text: 13 days = int(''.join(filter(str.isdigit, text))) 14 return scraped_time - timedelta(days=days) 15 else: 16 return scraped_time # fallback if format not recognized 17 18df['actual_posted_time'] = df.apply(estimate_posted_time, axis=1) 19 20# Sample Output: 21# df[['job_posting_time', 'date_scraped', 'actual_posted_time']] 22# ┌───────────────┬────────────────────┬────────────────────┐ 23# │ job_posting_time │ date_scraped │ actual_posted_time │ 24# ├───────────────┼────────────────────┼────────────────────┤ 25# │ 2d ago │ 2025-04-28 10:00:00 │ 2025-04-26 10:00:00 │ 26# │ 5h ago │ 2025-04-28 10:00:00 │ 2025-04-28 05:00:00 │ 27# └───────────────┴────────────────────┴────────────────────┘

1# -------------------------- 2# Group by Week and Count Postings 3# -------------------------- 4 5# Group posting dates into weeks (start of the week as label) 6df['week_start'] = df['actual_posted_time'].dt.to_period('W').apply(lambda r: r.start_time) 7 8# Count job postings per week 9weekly_counts = df.groupby('week_start').size().reset_index(name='num_postings') 10 11# Assign sequential week numbers and readable labels 12weekly_counts = weekly_counts.sort_values('week_start').reset_index(drop=True) 13weekly_counts['week_num'] = weekly_counts.index + 1 14weekly_counts['week_label'] = "Week " + weekly_counts['week_num'].astype(str) + "\n" + \ 15 weekly_counts['week_start'].dt.strftime("%d %b - ") + \ 16 (weekly_counts['week_start'] + timedelta(days=6)).dt.strftime("%d %b") 17 18# Sample Output: 19# weekly_counts 20# ┌────────────┬───────────────┬───────────┬─────────────────────┐ 21# │ week_start │ num_postings │ week_num │ week_label │ 22# ├────────────┼───────────────┼───────────┼─────────────────────┤ 23# │ 2025-03-31 │ 18 │ 1 │ Week 1\n31 Mar - 06 Apr │ 24# │ 2025-04-07 │ 24 │ 2 │ Week 2\n07 Apr - 13 Apr │ 25# └────────────┴───────────────┴───────────┴─────────────────────┘

1# -------------------------- 2# Plot 3# -------------------------- 4 5plt.figure(figsize=(14, 6)) 6 7# Bar chart showing weekly counts 8bars = plt.bar(weekly_counts['week_label'], weekly_counts['num_postings'], color='orchid') 9 10# Add value labels slightly inside bars to avoid overlap with grid lines 11for bar in bars: 12 yval = bar.get_height() 13 plt.text(bar.get_x() + bar.get_width() / 2, yval - 1, int(yval), ha='center', va='top', fontsize=10) 14 15# Add descriptive title with date range 16min_date = df['actual_posted_time'].min().strftime("%d %b %Y") 17max_date = df['actual_posted_time'].max().strftime("%d %b %Y") 18 19plt.title(f"Weekly Job Postings for Data Roles (Sydney NSW)\n({min_date} – {max_date})", fontsize=14) 20plt.xlabel("Week and Date Range") 21plt.ylabel("Number of Job Postings") 22plt.xticks(rotation=45, ha='right') 23plt.tight_layout() 24plt.grid(axis='y', linestyle='--', alpha=0.6) 25plt.show()

Plot the number of jobs posted per week (Source Code)

Next Up

Lesson 16: Mini Project: Job Salary Analysis Overview (Business Requirements)

Next Up

Lesson 16: Mini Project: Job Salary Analysis Overview (Business Requirements)