Initial release — Skool community lesson scraper

Downloads lessons from any Skool classroom to local Markdown files.
Cross-platform (Mac/Windows/Linux), membership-gated, safe to re-run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Kisa 2026-05-04 20:28:20 -04:00
commit 56c7b53254
3 changed files with 301 additions and 0 deletions

95
README.md Normal file
View file

@ -0,0 +1,95 @@
# skool-lesson-scrape
Download lessons from any Skool community classroom to local Markdown files.
- Works on Mac, Windows, and Linux
- Skips lessons already saved — safe to re-run when new content is added
- Saves one `.md` file per lesson: `Course Name -- Lesson Title.md`
- Respects your membership tier — only downloads content your account can access
- Works great with Obsidian, Notion, or any Markdown-based knowledge system
---
## Requirements
- Python 3.8 or later
- A paid Skool account with access to the community you want to scrape
---
## Setup
**1. Install dependencies**
```bash
pip install -r requirements.txt
playwright install chromium
```
**2. Configure the script**
Open `scrape.py` and edit the two lines at the top of the CONFIG section:
```python
COMMUNITY = "navaigate" # slug from your Skool community URL
OUTPUT_DIR = Path.home() / "skool-lessons" # where .md files are saved
```
- `COMMUNITY`: find it in your Skool URL — `skool.com/your-community-slug`
- `OUTPUT_DIR`: any folder on your machine; created automatically if it doesn't exist
**Obsidian users** — point `OUTPUT_DIR` at a folder inside your vault:
```python
OUTPUT_DIR = Path.home() / "Documents" / "MyVault" / "Lessons"
```
**Windows users** — use a raw string for backslash paths:
```python
OUTPUT_DIR = Path(r"C:\Users\YourName\Documents\skool-lessons")
```
---
## Usage
**Full scrape** — downloads all lessons you have access to:
```bash
python scrape.py
```
A browser window will open. Log in to Skool normally (email/password or Google). The script takes over automatically once you land on the community.
**Re-run anytime** — already-saved lessons are skipped automatically.
**Debug mode** — inspect page structure without saving anything:
```bash
python scrape.py --discover
```
---
## How it works
Skool embeds course and lesson structure as JSON in the page source (`__NEXT_DATA__`). The script reads that directly to get course and lesson IDs, then navigates to each lesson and extracts the body text from Skool's TipTap editor (`.ProseMirror` selector). No fragile DOM scraping — the JSON structure is stable.
---
## Notes
- Content is gated by your own Skool membership — you can only download lessons your account has access to
- This tool is for personal offline backup, not redistribution of community content
- Re-running after new lessons are posted will only download what's new
---
## Troubleshooting
**"No courses found"** — a diagnostic HTML file is saved to your system temp folder. The page structure may have changed; open an issue with the HTML attached.
**Browser closes immediately** — make sure you completed the Playwright browser install: `playwright install chromium`
**Lessons saving as navigation boilerplate** — run `--discover` and open an issue with the output.
---
Built by [Kisa Fenn](https://github.com/kisasttil-gif) — STTIL Solutions

2
requirements.txt Normal file
View file

@ -0,0 +1,2 @@
playwright>=1.40.0
html2text>=2024.2.26

204
scrape.py Normal file
View file

@ -0,0 +1,204 @@
#!/usr/bin/env python3
"""
skool-lesson-scrape
Downloads lessons from a Skool community classroom to a local folder as Markdown files.
Skips lessons already saved safe to re-run when new content is added.
Usage:
python scrape.py # full scrape
python scrape.py --discover # inspect page structure without saving (debug)
Setup: see README.md
"""
import asyncio
import argparse
import re
import json
import tempfile
import html2text
from pathlib import Path
from playwright.async_api import async_playwright
# ── CONFIG — edit these two lines ────────────────────────────────────────────
#
# COMMUNITY: the slug from your Skool community URL
# e.g. https://www.skool.com/navaigate → "navaigate"
COMMUNITY = "navaigate"
#
# OUTPUT_DIR: folder where .md files are saved (created if it doesn't exist)
# Mac/Linux: Path.home() / "skool-lessons"
# Windows: Path(r"C:\Users\YourName\Documents\skool-lessons")
# Obsidian: Path.home() / "Documents" / "ObsidianVault" / "Lessons"
OUTPUT_DIR = Path.home() / "skool-lessons"
#
# ─────────────────────────────────────────────────────────────────────────────
BASE = "https://www.skool.com"
CLASSROOM = f"{BASE}/{COMMUNITY}/classroom"
DIAG_DIR = Path(tempfile.gettempdir()) / "skool_scrape_diag"
CONTENT_SELECTORS = [
".ProseMirror", # Skool's TipTap editor — primary target
"[class*='lesson-content']",
"[class*='lessonContent']",
"[class*='module-content']",
"[class*='content-body']",
"article",
"main",
]
def sanitize(name: str) -> str:
name = re.sub(r'[<>:"/\\|?*\n\r\t]', '', str(name)).strip().strip(".")
return name[:120]
def existing_stems() -> set:
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
return {f.stem for f in OUTPUT_DIR.glob("*.md")}
def next_data(html: str) -> dict:
m = re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
return json.loads(m.group(1)) if m else {}
def html_to_md(raw: str) -> str:
h = html2text.HTML2Text()
h.body_width = 0
h.ignore_links = False
h.ignore_images = True
return h.handle(raw)
def write_lesson(course_title: str, lesson_title: str, body: str) -> str:
stem = f"{sanitize(course_title)} -- {sanitize(lesson_title)}"
out = OUTPUT_DIR / f"{stem}.md"
out.write_text(f"# {lesson_title}\n\n{body}", encoding="utf-8")
return stem
async def lesson_body(page) -> str:
"""Content is rendered client-side into .ProseMirror (Skool's TipTap editor)."""
for sel in CONTENT_SELECTORS:
el = await page.query_selector(sel)
if el:
inner = await el.inner_html()
if len(inner) > 200:
return html_to_md(inner)
return html_to_md(await page.evaluate("() => document.body.innerHTML"))
async def run(discover: bool = False):
existing = existing_stems()
print(f"Output folder: {OUTPUT_DIR}")
print(f"Lessons already saved: {len(existing)}\n")
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False, slow_mo=25)
ctx = await browser.new_context(viewport={"width": 1440, "height": 900})
page = await ctx.new_page()
print("Opening Skool — please log in when the browser window appears.")
print("The script will continue automatically once you land on the community.\n")
await page.goto("https://www.skool.com/login")
await page.wait_for_url(f"**/{COMMUNITY}/**", timeout=300_000)
print("Logged in.\n")
await page.goto(CLASSROOM)
await page.wait_for_load_state("load")
await asyncio.sleep(3)
nd = next_data(await page.content())
all_crses = nd.get("props", {}).get("pageProps", {}).get("allCourses", [])
courses = [c for c in all_crses if c.get("metadata", {}).get("hasAccess", 0)]
print(f"Accessible courses: {len(courses)} of {len(all_crses)} total\n")
if not courses:
DIAG_DIR.mkdir(parents=True, exist_ok=True)
(DIAG_DIR / "classroom.html").write_text(await page.content())
print(f"No courses found. Diagnostic HTML saved to {DIAG_DIR}")
await browser.close()
return
if discover:
course_url = f"{CLASSROOM}/{courses[0]['name']}"
await page.goto(course_url)
await page.wait_for_load_state("load")
await asyncio.sleep(3)
cnd = next_data(await page.content())
children = cnd.get("props", {}).get("pageProps", {}).get("course", {}).get("children", [])
first = children[0]["course"] if children else None
if first:
await page.goto(f"{course_url}?md={first['id']}")
await page.wait_for_load_state("load")
await asyncio.sleep(3)
lpp = next_data(await page.content()).get("props", {}).get("pageProps", {})
print("Lesson pageProps keys:", list(lpp.keys()))
DIAG_DIR.mkdir(parents=True, exist_ok=True)
(DIAG_DIR / "lesson.html").write_text(await page.content())
await page.screenshot(path=str(DIAG_DIR / "lesson.png"), full_page=True)
print(f"Diagnostic files saved to {DIAG_DIR}")
await browser.close()
return
saved = skipped = errors = 0
for course in courses:
course_title = course["metadata"]["title"]
course_url = f"{CLASSROOM}/{course['name']}"
print(f"Course: {course_title}")
await page.goto(course_url)
await page.wait_for_load_state("load")
await asyncio.sleep(2.5)
children = (
next_data(await page.content())
.get("props", {})
.get("pageProps", {})
.get("course", {})
.get("children", [])
)
if not children:
print(" No lessons found — skipping\n")
continue
print(f" {len(children)} lessons")
for child in children:
lesson = child.get("course", {})
lesson_title = lesson.get("metadata", {}).get("title") or lesson.get("name") or "Untitled"
lesson_id = lesson.get("id", "")
stem = f"{sanitize(course_title)} -- {sanitize(lesson_title)}"
if stem in existing:
skipped += 1
continue
try:
await page.goto(f"{course_url}?md={lesson_id}")
await page.wait_for_load_state("load")
await asyncio.sleep(2)
stem = write_lesson(course_title, lesson_title, await lesson_body(page))
existing.add(stem)
saved += 1
print(f" [saved] {lesson_title[:65]}")
except Exception as e:
errors += 1
print(f" [error] {lesson_title[:65]}{e}")
print()
print("" * 52)
print(f"Done. Saved: {saved} Skipped: {skipped} Errors: {errors}")
print(f"Output: {OUTPUT_DIR}")
await browser.close()
if __name__ == "__main__":
ap = argparse.ArgumentParser(description="Download Skool community lessons to Markdown")
ap.add_argument("--discover", action="store_true", help="Debug page structure without saving")
asyncio.run(run(discover=ap.parse_args().discover))