✅Connect to Database and Fetch Data
✅Define SQL query to extract required columns with all non-null entries
✅Connect to database and read query result into a DataFrame as df
In a Colab code cell, run this:
1import pandas as pd2import matplotlib.pyplot as plt3from datetime import timedelta4
5# --------------------------6# Connect to Database and Fetch Data7# --------------------------8
9# Define SQL query to extract required columns with all non-null entries10query = """11SELECT12 job_title,13 job_posting_time,14 date_scraped,15 location16FROM seek_jobs17WHERE job_title IS NOT NULL18 AND job_posting_time IS NOT NULL19 AND date_scraped IS NOT NULL20 AND location IS NOT NULL21"""22
23# Connect to database and read query result into a DataFrame24conn = connect_to_db(DATABASE_URL)25df = pd.read_sql(query, conn)26
27# Sample Output at this step:28# df.head()29# ┌────────────────────────────┬───────────────┬────────────────────┬────────────────────┐30# │ job_title │ job_posting_time │ date_scraped │ location │31# ├────────────────────────────┼───────────────┼────────────────────┼────────────────────┤32# │ Data Analyst │ 2d ago │ 2025-04-28 10:00:00 │ Sydney NSW │33# │ Junior Data Scientist │ 5h ago │ 2025-04-28 10:00:00 │ Sydney NSW │34# │ Machine Learning Engineer │ 1d ago │ 2025-04-27 10:00:00 │ Melbourne VIC │35# └────────────────────────────┴✅Normalize Job Titles to Identify Data Roles
✅Define keywords to detect Data-related roles
✅Function classify_roles() to assign a normalized role label based on presence of keywords
✅Apply classification and filter only Data Analyst-related jobs
In a Colab code cell, run this:
1# --------------------------2# Normalize Job Titles to Identify Data Roles3# --------------------------4
5# Define keywords to detect Data-related roles6role_keywords = {7 "Data Analyst": ["data", "analytics"],8}9
10# Function to assign a normalized role label based on presence of keywords11def classify_role(title):12 title = title.lower()13 for role, keywords in role_keywords.items():14 if any(keyword in title for keyword in keywords):15 return role16 return "Other"17
18# Apply classification and filter only Data Analyst-related jobs19df['Role'] = df['job_title'].apply(classify_role)20df = df[df['Role'] == "Data Analyst"]21
22# Sample Output:23# df[['job_title', 'Role']]24# ┌────────────────────────────┬───────────────┐25# │ job_title │ Role │26# ├────────────────────────────┼───────────────┤27# │ Data Analyst │ Data Analyst │28# │ Data Scientist │ Data Analyst │29# └────────────────────────────┴───────────────┘✅Normalize Location and Filter for Sydney
In a Colab code cell, run this:
1# --------------------------2# Normalize Location and Filter for Sydney3# --------------------------4
5major_cities = ["Sydney NSW"]6
7# Function to extract and validate major city8def extract_city(location):9 if "," in location:10 city = location.split(",")[-1].strip()11 else:12 city = location.strip()13 return city if city in major_cities else "Other"14
15# Apply location normalization and keep only Sydney roles16df['clean_location'] = df['location'].apply(extract_city)17df = df[df['clean_location'] == "Sydney NSW"]18
19# Sample Output:20# df[['location', 'clean_location']]21# ┌────────────────────┬────────────────┐22# │ location │ clean_location │23# ├────────────────────┼────────────────┤24# │Parramatta,Sydney NSW│ Sydney NSW │25# │ Sydney NSW │ Sydney NSW │26# └────────────────────┴────────────────┘✅Convert "2d ago" or "5h ago" to actual datetime based on scraping time
✅A function called Estimate_posted_time() it takes the scraping date and the posted_duration such 2d ago, 3d ago etc.. so we convert that into the posted date by subtracting from the scraping date .
In a Colab code cell, run this:
1# --------------------------2# Estimate Actual Job Posting Date3# --------------------------4
5# Convert "2d ago" or "5h ago" to actual datetime based on scraping time6def estimate_posted_time(row):7 text = row['job_posting_time'].lower()8 scraped_time = row['date_scraped']9 if "h ago" in text:10 hours = int(''.join(filter(str.isdigit, text)))11 return scraped_time - timedelta(hours=hours)12 elif "d ago" in text:13 days = int(''.join(filter(str.isdigit, text)))14 return scraped_time - timedelta(days=days)15 else:16 return scraped_time # fallback if format not recognized17
18df['actual_posted_time'] = df.apply(estimate_posted_time, axis=1)19
20# Sample Output:21# df[['job_posting_time', 'date_scraped', 'actual_posted_time']]22# ┌───────────────┬────────────────────┬────────────────────┐23# │ job_posting_time │ date_scraped │ actual_posted_time │24# ├───────────────┼────────────────────┼────────────────────┤25# │ 2d ago │ 2025-04-28 10:00:00 │ 2025-04-26 10:00:00 │26# │ 5h ago │ 2025-04-28 10:00:00 │ 2025-04-28 05:00:00 │27# └───────────────┴────────────────────┴────────────────────┘✅Group posting dates into weeks (start of the week as label)
✅Count job postings per week
✅Assign sequential week numbers and readable labels
In a Colab code cell, run this:
1# --------------------------2# Group by Week and Count Postings3# --------------------------4
5# Group posting dates into weeks (start of the week as label)6df['week_start'] = df['actual_posted_time'].dt.to_period('W').apply(lambda r: r.start_time)7
8# Count job postings per week9weekly_counts = df.groupby('week_start').size().reset_index(name='num_postings')10
11# Assign sequential week numbers and readable labels12weekly_counts = weekly_counts.sort_values('week_start').reset_index(drop=True)13weekly_counts['week_num'] = weekly_counts.index + 114weekly_counts['week_label'] = "Week " + weekly_counts['week_num'].astype(str) + "\n" + \15 weekly_counts['week_start'].dt.strftime("%d %b - ") + \16 (weekly_counts['week_start'] + timedelta(days=6)).dt.strftime("%d %b")17
18# Sample Output:19# weekly_counts20# ┌────────────┬───────────────┬───────────┬─────────────────────┐21# │ week_start │ num_postings │ week_num │ week_label │22# ├────────────┼───────────────┼───────────┼─────────────────────┤23# │ 2025-03-31 │ 18 │ 1 │ Week 1\n31 Mar - 06 Apr │24# │ 2025-04-07 │ 24 │ 2 │ Week 2\n07 Apr - 13 Apr │25# └────────────┴───────────────┴───────────┴─────────────────────┘✅Bar chart showing weekly counts
✅Add value labels slightly inside bars to avoid overlap with grid lines
✅Add descriptive title with date range
In a Colab code cell, run this:
1# --------------------------2# Plot3# --------------------------4
5plt.figure(figsize=(14, 6))6
7# Bar chart showing weekly counts8bars = plt.bar(weekly_counts['week_label'], weekly_counts['num_postings'], color='orchid')9
10# Add value labels slightly inside bars to avoid overlap with grid lines11for bar in bars:12 yval = bar.get_height()13 plt.text(bar.get_x() + bar.get_width() / 2, yval - 1, int(yval), ha='center', va='top', fontsize=10)14
15# Add descriptive title with date range16min_date = df['actual_posted_time'].min().strftime("%d %b %Y")17max_date = df['actual_posted_time'].max().strftime("%d %b %Y")18
19plt.title(f"Weekly Job Postings for Data Roles (Sydney NSW)\n({min_date} – {max_date})", fontsize=14)20plt.xlabel("Week and Date Range")21plt.ylabel("Number of Job Postings")22plt.xticks(rotation=45, ha='right')23plt.tight_layout()24plt.grid(axis='y', linestyle='--', alpha=0.6)25plt.show()