Timesheet Analysis¶

Turning the CSV Files into Dataframes for Analysis¶

Step 1.1: Open and Read the Project Pulse CSV¶

In [5]:
# Import pandas
import pandas as pd

# Read the CSV file
pp_df = pd.read_csv('./data/project_pulse_tracker.csv')

# Display the output
display(pp_df.head())
STAGE STARTED AT ENDED AT DURATION (H) COST ($) REMARKS
0 Analysis 01-01-1970, 12:00AM 45752.79167 0.500000 13.000000 Planning
1 Analysis 01-01-1970, 12:00AM 45752.98403 0.066667 1.733333 Setup environment
2 Implementation 01-01-1970, 12:00AM 45753.35139 0.933333 24.266667 Setup project
3 Implementation 01-01-1970, 12:00AM 45753.41667 1.566667 40.733333 Add roles to User module
4 Implementation 01-01-1970, 12:00AM 45753.48333 0.883333 22.966667 Backend of Project module completed

Step 1.2: Open and Read the Praise Story CSV¶

In [4]:
# Import pandas
import pandas as pd

# Read the CSV file
ps_df = pd.read_csv('./data/praise_story_tracker.csv')

# Display the output
display(ps_df.head())
STAGE STARTED AT ENDED AT DURATION (H) COST ($) REMARKS
0 Analysis 45783.85278 45783.87500 0.533333 13.866667 Logo
1 Analysis 45784.49097 45784.51736 0.633333 16.466667 Screen Plan
2 Analysis 45784.51736 45784.52292 0.133333 3.466667 Project Plan
3 Implementation 45789.42361 45789.47569 1.250000 32.500000 Testimony migration/Status helper
4 Implementation 45789.82292 45789.91944 2.316667 60.233333 Testimony backend module

Used GenAI to Perform Analyses¶

Prompt: given these 2 csv files, what kind of analyses can I perform on them?

Also uploaded the 2 CSVs to ChatGPT

Response¶

The two CSV files — praise_story_tracker.csv and project_pulse_tracker.csv — contain similar structures, tracking the progress of tasks across different stages of two projects. Each file includes:

  • STAGE: Project stage (e.g., Analysis, Implementation)
  • STARTED AT and ENDED AT: Timestamps (Excel date format)
  • DURATION (H): Duration in hours
  • COST ($): Cost associated with the task
  • REMARKS: Descriptive notes

Here are several analyses you can perform on these datasets:¶

1. Time & Cost Summary¶

  • Total hours and total cost per project.
  • Average hourly rate per stage.
In [12]:
# Import pandas library
import pandas as pd

# Get the total cost of the 2 dfs
total_cost_pp = pp_df['COST ($)'].sum()
total_cost_ps = ps_df['COST ($)'].sum()

# Get the total duration of the 2 dfs
total_duration_pp = pp_df['DURATION (H)'].sum()
total_duration_ps = ps_df['DURATION (H)'].sum()

# Transform the total cost and durations into a df
data = {
    "Project Pulse": {
        "Total Cost ($)": total_cost_pp,
        "Total Duration (H)": total_duration_pp,
    },
    "Praise Story": {
        "Total Cost ($)": total_cost_ps,
        "Total Duration (H)": total_duration_ps,
    }
}

print('-- Summary --')
summary_df = pd.DataFrame(data).T
summary_df
-- Summary --
Out[12]:
Total Cost ($) Total Duration (H)
Project Pulse 2259.833333 86.916667
Praise Story 1620.666667 62.333333
In [13]:
# Import pandas
import pandas as pd

# Get average hourly rate per stage
pp_df['Hourly Rate'] = pp_df['COST ($)'] / pp_df['DURATION (H)']
ps_df['Hourly Rate'] = ps_df['COST ($)'] / ps_df['DURATION (H)']

# Group by Stage and Calculate the Average
avg1 = pp_df.groupby('STAGE')['Hourly Rate'].mean().rename('Project 1')
avg2 = ps_df.groupby('STAGE')['Hourly Rate'].mean().rename('Project 2')

comparison_df = pd.concat([avg1, avg2], axis=1)
print('-- Average Hourly Rate --')
display(comparison_df)
-- Average Hourly Rate --
Project 1 Project 2
STAGE
Analysis 26.0 26.0
Implementation 26.0 26.0
Testing 26.0 26.0
Design NaN 26.0

2. Stage Analysis¶

  • Total time and cost spent on each stage (e.g., Analysis vs. Implementation).
  • Compare how much time and money each project allocates to each stage.
In [14]:
# Import libraries
import pandas as pd

# Step 1: Group by Stage
# For Praise Story
praise_stage_summary = ps_df.groupby('STAGE')[['DURATION (H)', 'COST ($)']].sum().rename(
    columns={'DURATION (H)': 'Total Hours (Praise)', 'COST ($)': 'Total Cost (Praise)'})

# For Project Pulse
project_stage_summary = pp_df.groupby('STAGE')[['DURATION (H)', 'COST ($)']].sum().rename(
    columns={'DURATION (H)': 'Total Hours (Pulse)', 'COST ($)': 'Total Cost (Pulse)'})

# Step 2: Combine Results for Comparison
stage_comparison = pd.concat([praise_stage_summary, project_stage_summary], axis=1)
display(stage_comparison)
Total Hours (Praise) Total Cost (Praise) Total Hours (Pulse) Total Cost (Pulse)
STAGE
Analysis 1.300000 33.800000 0.566667 14.733333
Design 6.650000 172.900000 NaN NaN
Implementation 50.100000 1302.600000 72.133333 1875.466667
Testing 4.283333 111.366667 14.216667 369.633333

3. Trend Analysis¶

  • Plot cumulative cost and time over tasks to see workflow intensity or burnout risk.
  • Identify peaks in workload or cost.
In [16]:
# Import necessary libraries
import matplotlib.pyplot as plt

# Prepare cumulative data
praise_story_df_sorted = ps_df.sort_values(by='STARTED AT').copy()
praise_story_df_sorted['Cumulative Hours'] = praise_story_df_sorted['DURATION (H)'].cumsum()
praise_story_df_sorted['Cumulative Cost'] = praise_story_df_sorted['COST ($)'].cumsum()

project_pulse_df_sorted = pp_df.sort_values(by='STARTED AT').copy()
project_pulse_df_sorted['Cumulative Hours'] = project_pulse_df_sorted['DURATION (H)'].cumsum()
project_pulse_df_sorted['Cumulative Cost'] = project_pulse_df_sorted['COST ($)'].cumsum()

# Plotting
plt.figure(figsize=(12, 6))

# Praise Story plot
plt.plot(praise_story_df_sorted['STARTED AT'], praise_story_df_sorted['Cumulative Hours'], label='Praise Story - Hours', linestyle='-', marker='o')
plt.plot(praise_story_df_sorted['STARTED AT'], praise_story_df_sorted['Cumulative Cost'], label='Praise Story - Cost', linestyle='--', marker='x')

# Project Pulse plot
plt.plot(project_pulse_df_sorted['STARTED AT'], project_pulse_df_sorted['Cumulative Hours'], label='Project Pulse - Hours', linestyle='-', marker='o')
plt.plot(project_pulse_df_sorted['STARTED AT'], project_pulse_df_sorted['Cumulative Cost'], label='Project Pulse - Cost', linestyle='--', marker='x')

plt.title('Cumulative Cost and Hours Over Time')
plt.xlabel('Started At (Excel Date Format)')
plt.ylabel('Cumulative Value')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

4. Comparative Analysis¶

  • Compare stage-wise efficiency between the two projects.
  • Determine which project phase was more cost/time-efficient.
In [17]:
# Import libraries
import pandas as pd

# Assuming you have already loaded your CSV files into two DataFrames:
# praise_story_df and project_pulse_df

# Step 1: Sum total duration and cost per stage
praise_eff = ps_df.groupby('STAGE')[['DURATION (H)', 'COST ($)']].sum()
project_eff = pp_df.groupby('STAGE')[['DURATION (H)', 'COST ($)']].sum()

# Step 2: Compute efficiency (Cost per Hour)
praise_eff['Efficiency (Praise)'] = praise_eff['COST ($)'] / praise_eff['DURATION (H)']
project_eff['Efficiency (Pulse)'] = project_eff['COST ($)'] / project_eff['DURATION (H)']

# Step 3: Combine efficiency columns into one comparison table
efficiency_comparison = pd.concat([
    praise_eff['Efficiency (Praise)'],
    project_eff['Efficiency (Pulse)']
], axis=1)

# Step 4: Print or export
display(efficiency_comparison)
Efficiency (Praise) Efficiency (Pulse)
STAGE
Analysis 26.0 26.0
Design 26.0 NaN
Implementation 26.0 26.0
Testing 26.0 26.0

5. Remarks-based Insights¶

  • Group similar remarks to analyze recurring tasks or bottlenecks.
  • Use keyword analysis on remarks to understand focus areas.
In [20]:
# Import libraries
from collections import Counter
import re
import pandas as pd

# Combine remarks from both projects
combined_remarks = pd.concat([ps_df['REMARKS'], pp_df['REMARKS']])

# Convert to lowercase, remove punctuation, and tokenize
def tokenize_remarks(remarks_series):
    tokens = []
    for remark in remarks_series.dropna():
        words = re.findall(r'\b\w+\b', remark.lower())  # Extract words
        tokens.extend(words)
    return tokens

# Tokenize and count keyword frequency
tokens = tokenize_remarks(combined_remarks)
keyword_counts = Counter(tokens)

# Convert to DataFrame for display
keyword_df = pd.DataFrame(keyword_counts.items(), columns=['Keyword', 'Count'])
keyword_df_sorted = keyword_df.sort_values(by='Count', ascending=False).reset_index(drop=True)

keyword_df_sorted.head(10)
Out[20]:
Keyword Count
0 module 29
1 project 13
2 tests 11
3 wip 8
4 testimony 8
5 and 8
6 record 7
7 controller 7
8 backend 7
9 activity 6

<< Back

Davina's Jupyter Notebooks © Davina Leong, 2025