Status: ✅ COMPLETED - Foundational analytics for Issue #3 code extraction
This analytics suite provides comprehensive word frequency analysis and conversational pattern detection capabilities for ChatGPT and Claude conversation exports. It serves as the foundational research for Issue #3’s code extraction system.
| Script | Purpose | Key Features |
|---|---|---|
conversation_context_analyzer.py |
Word frequency & code patterns | 15+ regex patterns, threading analysis, topic emergence |
python_word_analyzer.py |
Python-specific conversations | Language detection, technical term extraction |
frequency_conversation_finder.py |
Frequency analysis topics | Identifies conversations about frequency analysis |
you_vs_your_analyzer.py |
Pronoun usage patterns | Role-based analysis, platform comparison |
technical_discussion_analyzer.py |
Technical complexity scoring | Basic technical metrics |
improved_technical_analyzer.py |
Enhanced technical metrics | Weighted scoring (code blocks: 10x, functions: 8x) |
final_technical_insights.py |
Comprehensive insights | Statistical analysis and correlations |
conversation_context_analysis.csv - Word frequency data
across all conversationspython_conversations_focused.csv - Python-specific
conversation subsetfrequency_analysis_conversations.csv - Conversations
about frequency analysisyou_vs_your_analysis.csv - Pronoun usage by role and
conversationimproved_technical_analysis.csv - Enhanced technical
complexity scoringconversation_context_dashboard.html - Interactive word
frequency visualizationsconversation_context_report.md - Comprehensive word
frequency analysisconversation_context_performance.md - Performance
metrics and timingmost_relevant_frequency_conversation.txt - Full text of
highest-relevance conversationInstall all required dependencies:
pip install -r requirements.txtRequired packages: - plotly>=5.0.0 -
Interactive visualizations - pandas>=1.5.0 - Data
manipulation - beautifulsoup4>=4.11.0 - HTML parsing
(legacy scripts) - requests>=2.28.0 - HTTP requests
(legacy scripts)
The analytics scripts support flexible file configuration:
python src/conversation_context_analyzer.py chatgpt.json claude.json
python src/you_vs_your_analyzer.py chatgpt.json claude.jsonexport CHATGPT_CONVERSATIONS_FILE=path/to/chatgpt.json
export CLAUDE_CONVERSATIONS_FILE=path/to/claude.json
python src/conversation_context_analyzer.pymkdir data/
# Place your files as:
# data/chatgpt_conversations.json
# data/claude_conversations.json
python src/conversation_context_analyzer.py[
{
"id": "conversation_id",
"mapping": {
"node_id": {
"message": {
"content": {"parts": ["message text"]},
"create_time": 1234567890,
"author": {"role": "user"}
}
}
}
}
][
{
"uuid": "conversation_id",
"created_at": "2024-01-01T00:00:00Z",
"chat_messages": [
{
"text": "message text",
"sender": "human",
"created_at": "2024-01-01T00:00:00Z"
}
]
}
]Run in this recommended order for full analysis:
# 1. Core context analysis
python src/conversation_context_analyzer.py
# 2. Python-specific analysis
python src/python_word_analyzer.py
# 3. Frequency analysis identification
python src/frequency_conversation_finder.py
# 4. Pronoun usage analysis
python src/you_vs_your_analyzer.py
# 5. Technical complexity analysis
python src/technical_discussion_analyzer.py
python src/improved_technical_analyzer.py
# 6. Final insights generation
python src/final_technical_insights.pyTest the analytics pipeline:
python src/analytics_validator.pyThis validator checks: - ✅ All modules import successfully - ✅ Required dependencies are available - ✅ Basic functionality works - ✅ Expected output files exist
This analytics suite directly enables Issue #3 (Code Extraction) by providing:
# Technical complexity scoring for prioritization
technical_score = (
code_blocks * 10 + function_defs * 8 + imports * 6 +
assignments * 3 + file_refs * 5 + cli_commands * 4
) / word_count * 1000The analytics suite processes: - 2,254 conversations analyzed across 27 months - 50K+ messages processed with streaming efficiency - 15+ technical patterns identified with weighted scoring - Memory-efficient processing using generators for large datasets
def generate_timeline_streaming(self) -> Generator[Dict[str, Any], None, None]:
# Memory-efficient processing for 1000+ conversations
for conv in self.load_data_streaming():
yield self.process_single_conversation(conv)def prioritize_code_extraction(conversations):
# Priority extraction based on analytics findings:
# 1. High technical complexity score (>400)
# 2. Strong "your" preference (indicates user code)
# 3. Python/web development conversations
# 4. ChatGPT platform (more possessive code references)Issue Status: ✅ COMPLETED - Ready for Issue #3 code extraction implementation
This comprehensive analytics foundation provides the necessary conversation filtering, quality scoring, and pattern detection capabilities to enable effective code extraction from AI conversation data.