@sitely/framework
@sitely/framework is the package authors import. It exports defineSite and defineSegment (the builder entry points), urlPattern, presence, asset, the TTL preset constants, ctx.lazy, the framework error hierarchy, the runtime type surface, the build and test-pkg subsystems, and the sitely CLI. If a piece of behaviour spans author-time and build-time, it lives here.
This page covers the top-level modules under packages/framework/src/. The build pipeline and the in-process test runner each get their own deep dives.
The DSL — builder pattern
A site definition is assembled by chaining off defineSite({...}). Each step accumulates type information into the next, so cross-references (derivedFrom, extract's return keys, provides … the implicit ones) are type-safe.
import { defineSite, urlPattern, presence, asset, TTL } from "@sitely/framework";
import { Article } from "@sitely/schemas";
import { z } from "zod";
import pkg from "../package.json" with { type: "json" };
const articleUrl = urlPattern("https://blog.example.com/article/:id");
const commentsUrl = urlPattern("https://blog.example.com/api/comments/:id");
const PostArticle = z.object({
"@type": z.literal("Article"),
headline: z.string(),
body: z.string(),
author: presence(z.string().nullable(), 0.9),
heroImage: presence(asset("image"), 0.7),
});
const CommentList = z.object({
items: z.array(z.object({
author: z.string(),
body: z.string(),
postedAt: z.string().datetime(),
})),
});
export default defineSite({
site: { id: "blog", displayName: "Example Blog", version: pkg.version },
origins: [{ hostname: "blog.example.com" }],
rateLimit: { maxConcurrent: 2, requestsPerSecond: 1 },
})
.checkResponse((response) => {
if (response.includes("Something went wrong")) {
throw new RateLimitedError({ retryAfter: 60_000 });
}
})
.resource("article", {
schema: PostArticle,
url: articleUrl,
ttl: TTL.daily,
})
.resource("comments", {
schema: CommentList,
derivedFrom: "article",
extract: async (ctx, article) => {
const r = await ctx.fetch(commentsUrl.toUrl({ id: ctx.params.id }));
return await r.json();
},
ttl: TTL.short,
})
.page(articleUrl, {
validate: (ctx) => ctx.$("article").exists(),
extract: async (ctx) => ({
article: {
"@type": () => "Article" as const,
headline: () => ctx.$("h1").text(),
body: () => ctx.$("article .content").text(),
author: () => ctx.$(".byline").text(),
heroImage: () => ctx.$('meta[property="og:image"]').attr("content"),
},
}),
fixtures: [
{ params: { id: "hello-world" } },
{ params: { id: "not-found" }, errorCase: true },
],
})
.build();That's the whole authoring surface in one example. Each line below decomposes one piece.
defineSite({ site, origins, rateLimit })
The builder entry. Takes the always-required header: the site's identity, its origins, and its outbound rate limit. Everything else is added via chained methods.
export function defineSite<TConfig extends SiteHeader>(config: TConfig): SiteBuilder<TConfig, {}, {}>;
interface SiteHeader {
site: { id: string; displayName: string; version: string; homepage?: string };
origins: Origin[];
rateLimit: RateLimitConfig;
locales?: LocaleConfig;
family?: FamilyConfig;
crawl?: CrawlConfig;
framework?: { minVersion?: string; maxVersion?: string };
normalizeUrl?: (url: string) => string;
}site.version is injected by sitely build from package.json. Authors don't write it by hand; the import-from-package.json pattern in the example is the recommended shape so type-checking still sees the version field.
.resource(name, def)
Registers a typed resource. The accumulated builder type tracks TResources so subsequent calls can reference the resource by name (e.g. derivedFrom: "article"):
interface ResourceDef<TSchema, TUrlPattern> {
schema: TSchema; // Standard Schema validator
url?: TUrlPattern; // URLPattern, mutually exclusive with derivedFrom
derivedFrom?: keyof TResources; // typed against accumulated resources
extract?: ExtractFn; // required when derivedFrom is set
ttl: ResourceTTL; // see TTL presets
}Two variants:
- URL-bound resource (
url): the framework fetches the URL viaurl.toUrl(params), runs the matching page'svalidate+extract, caches the result. - Derived resource (
derivedFrom): the framework fetches the parent resource first, then runs this resource'sextractwith the parent's data as the second argument. Useful for resources that don't have their own URL.
You can't declare both url and derivedFrom. The build rejects packages that try.
.page(urlPattern, def)
Declares a page — a URL pattern with validate, extract, optional paginate, and inline fixtures. The page's URL pattern is also its identifier in the manifest:
interface PageDef<TParams, TResources> {
validate: (ctx: ExtractContext<TParams>) => boolean;
extract: (ctx: ExtractContext<TParams>) => ExtractReturn<TResources>;
paginate?: { next: (ctx: ExtractContext<TParams>) => string | null };
fixtures: FixtureSpec<TParams>[];
}
type ExtractReturn<TResources> = Partial<{
[K in keyof TResources]: { [F in keyof ResourceOutput<TResources[K]>]: () => ResourceOutput<TResources[K]>[F] };
}>;The ExtractReturn constraint is what gives the framework provides for free — the keys actually returned by extract are the resources the page provides. sitely build dry-runs each page's extract on a fixture to populate the manifest's per-page provides list.
Each leaf value in extract's return is a field function — a zero-argument function that produces the value. The framework calls each in turn, catches per-field errors, and validates the assembled object against the resource's schema.
.checkResponse(fn)
Optional site-level smoke test that runs before per-page validate / extract. Receives a response snapshot; throws a framework error to signal bad responses:
.checkResponse((response) => {
if (response.status === 429 || response.headers["retry-after"]) {
throw new RateLimitedError({ retryAfter: Number(response.headers["retry-after"]) * 1000 });
}
if (response.has(".captcha-challenge")) {
throw new CaptchaError();
}
if (response.includes("This page has been removed")) {
throw new PermanentError({ reason: "content removed" });
}
})The author's checkResponse runs after the framework's built-in CAPTCHA detection; the well-known anti-bot services are caught for you. See The framework error hierarchy below.
Built-in CAPTCHA detection
A small auto-detection step runs before the author's checkResponse. It inspects the response's headers, cookies, and body for known signatures of common anti-bot services. On a match, the framework throws CaptchaError({ service }) and the author's checkResponse doesn't run.
The detected services and the signals each one matches on:
| Service | Header signals | Cookie signals | Body signals |
|---|---|---|---|
cloudflare | cf-mitigated: challenge, server: cloudflare + cf-ray | __cf_bm, cf_clearance | <title>Just a moment..., /cdn-cgi/challenge-platform/ |
datadome | x-datadome-cid, x-dd-b | datadome | <title>You have been blocked, dd_cookie_test_ |
perimeterx | x-px-block, x-px-action | _px3 | <script src="//client.perimeterx.net/ |
incapsula | x-iinfo, x-cdn: Incapsula | visid_incap_*, incap_ses_* | _Incapsula_Resource |
akamai | x-akamai-edgescape, akamai-prefixed mgr headers | _abck, bm_sz | (header-detected) |
recaptcha (opt-in) | — | — | class="g-recaptcha", https://www.google.com/recaptcha/ |
hcaptcha (opt-in) | — | — | class="h-captcha", https://js.hcaptcha.com/ |
recaptcha and hcaptcha are off by default because they appear on legitimate forms too (login, signup, comment boxes). Opt in per site when the site is known to gate behind one of them.
Configuration
// Default: all five anti-bot services on; the two interactive captchas off.
defineSite({ site, origins, ... }) // (no `detectCaptcha` field → defaults apply)
// Opt in/out per service:
defineSite({
site,
origins,
detectCaptcha: {
cloudflare: true,
datadome: true,
recaptcha: true, // opt in
},
})
// Disable entirely (author handles everything in checkResponse):
defineSite({ site, origins, detectCaptcha: false })Overriding a match
The author's checkResponse runs only when detection didn't throw. If you want to suppress a specific match — for example, the site shows a Cloudflare interstitial to all visitors and a real article still loads — set detectCaptcha: { cloudflare: false } and write the suppression logic in checkResponse yourself.
What it doesn't catch
- Site-specific anti-bot pages with no known service marker — still need a hand-written
checkResponserule. - Soft blocks that return a
200with a stub body — schema validation usually catches those. - IP-level rate limits served by CDNs without a captcha challenge — those surface as HTTP
429and the framework'sRateLimitedErrormapping.
Detection patterns live in @sitely/framework and update via patch releases; consumers get fixes for free on pnpm update. The patterns aren't part of any site package's manifest, so changes don't trigger semver-discipline.
.use(segment)
Composes a segment defined in another file. The segment's accumulated TResources and TPages extend the site's; cross-references in the segment must be satisfied by what's already accumulated when .use() is called.
// pages/article.ts
import { defineSegment, urlPattern, TTL } from "@sitely/framework";
import { Article } from "@sitely/schemas";
const articleUrl = urlPattern("https://blog.example.com/article/:id");
export const articleSegment = defineSegment()
.resource("article", { schema: Article, url: articleUrl, ttl: TTL.daily })
.page(articleUrl, { validate: ..., extract: ..., fixtures: [...] });
// src/index.ts
export default defineSite({...})
.use(articleSegment)
.use(commentsSegment)
.build();Strict ordering: if commentsSegment references "article" via derivedFrom, it must be .use()d after articleSegment. A compose-out-of-order is a compile error at the .use() boundary.
.build()
Terminal step. Returns the final SiteDefinition (no longer a builder). The server, the test harness, and sitely build all consume the result of .build().
URLPattern — the bidirectional URL primitive
export function urlPattern<TPattern extends string>(
pattern: TPattern,
paramsSchema?: Record<ExtractParams<TPattern>, StandardSchemaV1>,
): URLPattern<ExtractParams<TPattern>>;
export interface URLPattern<TParams extends Record<string, string>> {
readonly pattern: string;
toUrl(params: TParams): string;
parseUrl(url: string): TParams | null;
}ExtractParams<TPattern> is a TypeScript type-level operation that pulls :segment placeholders out of the literal pattern string. urlPattern("/article/:id") returns URLPattern<{ id: string }>.
The optional paramsSchema adds runtime validation when the URL is built or parsed. A schema mismatch on toUrl or parseUrl throws ParamValidationError.
The underlying implementation is the web standard URL Pattern API where available, with a small polyfill for older runtimes.
ExtractContext — what callbacks see
export interface ExtractContext<TParams extends Record<string, string>> {
$(selector: string): PageElement | null;
$$(selector: string): PageElement[];
jsonLd(type?: string): Record<string, unknown>[];
params: TParams;
url: string;
canonical: string | null;
status: number;
headers: Record<string, string>;
locale: string | null;
fetch(url: string, opts?: RequestInit): Promise<Response>;
lazy<T>(fn: () => T | Promise<T>): () => T | Promise<T>;
}The context is generic over the page's URL pattern, so ctx.params is typed as TParams ({ id: string } for /article/:id).
$,$$: come from the activePageDriver(see @sitely/page). Authors never construct a driver.jsonLd(type?): pulls every<script type="application/ld+json">block on the page, optionally filtered by@type.fetch: outbound HTTP. Subject to the site's rate limit and circuit breaker.lazy: seectx.lazy.
Field functions
Every leaf in an extract return is a function. Even constants:
extract: async (ctx) => ({
article: {
"@type": () => "Article" as const, // constant — still a function
headline: () => ctx.$("h1").text(),
body: () => ctx.$("article .content").text(),
publishedAt: async () => new Date(ctx.$("time").attr("datetime") ?? ""),
},
}),Per-field execution: the framework calls each function in turn, awaiting any that return Promises. Throws are caught, recorded as per-field errors with the resolved field path (article.publishedAt), and the field is treated as absent. Schema validation then decides whether absence is permitted (it is, for presence-annotated optional fields).
Field functions also enable resource filter projection: if a request includes ?resources=article and the page provides multiple resources, only the requested resource's field functions are invoked.
Shared computation: ctx.lazy
When multiple fields need the same expensive computation (e.g. parse JSON-LD once, read it from several fields), wrap it in ctx.lazy:
const jsonLd = ctx.lazy(() => ctx.jsonLd("Article")[0]);
extract: async (ctx) => ({
article: {
headline: () => jsonLd().headline ?? ctx.$("h1").text(),
author: () => jsonLd().author?.name,
date: () => jsonLd().datePublished,
},
}),The producer runs at most once. Subsequent calls return the memoised value. Errors are captured and re-thrown — if the producer throws on first call, every dependent field sees the same error instance, so telemetry attributes the failure to the upstream cause rather than to each consumer field.
Async variant: ctx.lazy(async () => ...) returns () => Promise<T>, awaited per call but resolved once.
The framework error hierarchy
Authors throw typed errors from checkResponse or extract to signal failure modes. The framework catches them and maps to consumer-facing status. Two families plus an internal error:
export class FrameworkError extends Error {}
// Response errors — about the response itself being bad
export class ResponseError extends FrameworkError {}
export class RateLimitedError extends ResponseError { constructor(opts?: { retryAfter?: number }); }
export class BlockedError extends ResponseError { constructor(opts?: { retryAfter?: number }); }
export class CaptchaError extends BlockedError {
constructor(opts?: {
service?: "cloudflare" | "datadome" | "perimeterx" | "incapsula" | "akamai" | "recaptcha" | "hcaptcha";
retryAfter?: number;
});
}
export class TransientError extends ResponseError { constructor(opts?: { retryAfter?: number }); }
export class PermanentError extends ResponseError { constructor(opts: { reason: string }); }
export class BadResponseError extends ResponseError { constructor(opts: { reason: string }); }
// Extraction errors — about extracting from a good response
export class ExtractionError extends FrameworkError {}
export class MissingDataError extends ExtractionError { constructor(opts: { field: string; reason: string }); }
export class MalformedDataError extends ExtractionError { constructor(opts: { field: string; reason: string }); }
// Internal — framework throws after extract when schema validation fails
export class ValidationError extends FrameworkError {}See the framework errors glossary entry for what each maps to as a consumer-facing status and when authors should throw which. The per-host circuit breaker counts only ResponseError toward opening — extraction errors are author bugs, not site outages.
Why a builder, not a literal object
Cross-references between resources, pages, and segments need types to flow as each step is added:
derivedFrom: "article"must be typed against an accumulated resource keys.extract's return must be a subset of registered resources..use(segment)must satisfy the segment'srequirestype.
A literal object can't type-check a key against another key in the same object — the inference happens in one pass. A builder accumulates types step-by-step, so each .resource() extends the type tracked by .page() and .use() that follow.
The cost is a .build() at the end and slightly more verbose composition. The win is full type safety on cross-references — every name reference is a compile error if it doesn't exist.
Module map
One subsection per top-level file. Subsystems with their own page (./build, ./test-pkg) get a single sentence; deeper coverage is one link away.
./index.ts — the public surface
Re-exports everything authors import:
- DSL entry:
defineSite,defineSegment. - URL primitive:
urlPattern,URLPatterntype. - Schema helpers:
presence,asset. - TTL presets:
TTL.realtime,TTL.short,TTL.medium,TTL.daily,TTL.weekly. - Framework errors: the full hierarchy above.
- Runtime contract types:
SiteDefinition,ExtractContext,PageDef,ResourceDef,RateLimitConfig,FrameworkRange,FamilyConfig,LocaleConfig,Origin,SiteIdentity,AssetRef,ResourceTTL,CrawlConfig,PaginateDef,ResponseSnapshot. - JSON-LD helpers:
extractJsonLd,filterJsonLdByType. - robots.txt parsing:
parseRobotsTxt,RobotsChecker. - OpenAPI emission:
generateOpenApiSpec. - Build entry:
buildPackage,validatePackage,snapshotUrl,stableSerialize,parseTTL, plus the matching types. - Test-pkg entry:
testPackageand its result types.
./types.ts — the contract
Types only — no runtime code. SiteDefinition is the final shape defineSite(...).build() produces; the intermediate SiteBuilder<TConfig, TResources, TPages> types live alongside.
Edge case: adding a required field to SiteDefinition is a major-version bump for the framework. Optional fields with defaults are the way to evolve the surface.
./builder.ts — the builder implementation
defineSite() and defineSegment() return builder objects whose methods (.resource, .page, .use, .checkResponse, .build) accumulate state into a new builder. Each method's TypeScript signature carries the next accumulation step (e.g. .resource(name, def) returns SiteBuilder<TConfig, TResources & { [name]: def }, TPages>).
The runtime implementation is straightforward — a plain object holding a mutable map of resources and pages — but the generics are where the type safety lives.
./url-pattern.ts — URLPattern factory
export function urlPattern<TPattern extends string>(
pattern: TPattern,
paramsSchema?: Record<ExtractParams<TPattern>, StandardSchemaV1>,
): URLPattern<ExtractParams<TPattern>>;Wraps the web standard URLPattern constructor with the inferred-params type. The ExtractParams<TPattern> operation is a TypeScript template-literal type that finds :segment placeholders.
Edge case: invalid pattern. Throws synchronously at the call site. urlPattern("/article") with no : is valid (empty params).
./schema-helpers.ts — presence, asset
export function presence<S extends StandardSchemaV1>(schema: S, rate: number): S;
export function asset(type: "image" | "video" | "audio" | "document"): StandardSchemaV1;presence(schema, rate) is a Standard Schema wrapper that decorates the schema with a presence-rate annotation. The framework reads the annotation at build time (emits as a JSON Schema annotation) and at runtime (telemetry samples observed vs declared rate).
asset(type) returns a Standard Schema validator accepting a URL string with the asset-type brand attached. Runtime introspects via the brand for "discover all assets on this resource" tooling.
./errors.ts — the framework error classes
Defines the hierarchy above. Each class carries a stable kind discriminator string so consumers can match without instanceof.
./ttl.ts — TTL presets + parsing
export const TTL: {
realtime: ResourceTTL; // { default: "30s", min: "10s", max: "5m" }
short: ResourceTTL; // { default: "5m", min: "1m", max: "1h" }
medium: ResourceTTL; // { default: "1h", min: "10m", max: "6h" }
daily: ResourceTTL; // { default: "24h", min: "1h", max: "7d" }
weekly: ResourceTTL; // { default: "7d", min: "1d", max: "30d" }
};
export function parseTTL(s: string): number | null;Authors use the presets in resource definitions: ttl: TTL.daily. Custom values are still allowed: ttl: { default: "2h", min: "30m", max: "12h" }.
./context.ts — ExtractContext factory
Builds the context object every page callback receives. The default fetch throws — only the server installs a real one. JSON-LD parsing is deferred and memoised across ctx.jsonLd() calls on the same context.
./origins.ts — locale → hostname resolution
export function getActiveOrigins(site: SiteDefinition, locale?: string): Array<{ hostname: string; locale: string | null }>;
export function getPrimaryHostname(site: SiteDefinition, locale?: string): string;
export function getAllHostnames(site: SiteDefinition): string[];Pure functions. Resolves locale-templated origins (e.g. {locale}.wikipedia.org) against the locales.values set. The server's site-loader uses getAllHostnames() to build its hostname → site dispatch table.
./robots.ts — robots.txt parsing
export interface RobotsChecker {
isAllowed(url: string, userAgent?: string): boolean;
}
export function parseRobotsTxt(robotsTxtUrl: string, content: string): RobotsChecker;A thin wrapper around robots-parser. The server's robots-service calls this once per origin per TTL, caches the resulting RobotsChecker, and consults it before any outbound request.
Edge cases:
- Malformed robots.txt — treated as "every URL is allowed" (matches the standard).
- No robots.txt at all — same. Absence means permitted.
- User-agent matching is case-insensitive.
./json-ld.ts — JSON-LD helpers
export function extractJsonLd(driver: PageDriver): Record<string, unknown>[];
export function filterJsonLdByType(items: Record<string, unknown>[], type: string): Record<string, unknown>[];Used internally by ctx.jsonLd(type?); re-exported for site code that wants direct access. Flattens @graph containers, ignores unparseable JSON islands gracefully.
./cli.ts — the sitely binary
#!/usr/bin/env nodeDispatch entrypoint behind the sitely command. Delegates to ./build, ./test-pkg, and ./openapi.
./openapi.ts — OpenAPI 3.1 emitter
export function generateOpenApiSpec(sites: SiteDefinition[]): Record<string, unknown>;Each resource with a url becomes a /v1/sites/{host}/{resource} route; derivedFrom resources are included with a parent-link annotation. The generic endpoints (/v1/extract, /v1/sites, /v1/schemas, /healthz) are added uniformly.
./testing.ts — vitest-dependent entrypoint
Authors import from @sitely/framework/testing in their *.test.ts files. Provides vitest-bound helpers (describePageExtraction, createFixtureLoader) on top of the vitest-free harness.
The build subsystem (brief)
./build/ is the path from authored SiteDefinition to dist/index.js + dist/manifest.json + dist/schemas/*.json + dist/baseline-manifest.json. Exposes buildPackage() as the single entrypoint. Bundling, version injection, schema emission, manifest assembly, deterministic serialization.
Deep dive: @sitely/framework — build subsystem.
The test-pkg subsystem (brief)
./test-pkg/ is the in-process test runner plus the eight checks. Discovers fixtures by walking each page's inline declarations, runs validate + extract per fixture, and aggregates check results.
Deep dive: @sitely/framework — test-pkg subsystem.
The CLI
| Command | Notable flags | What it does |
|---|---|---|
sitely build | --dry-run, --publish | Compiles src/index.ts via esbuild with version injected, emits dist/index.js + manifest + schemas. --publish rotates dist/baseline-manifest.json to the freshly-built manifest. |
sitely test | --only <name>, --skip <name>, --watch, --update <fixture>, --diff <fixture>, --strict, --verbose | Runs the eight checks. Per-flag detail in The test suite. |
sitely check | --live | Static validation — site definition shape, schema references, URLPattern parse. --live fetches each page's example URL and runs full extraction; opt-in. |
sitely snapshot <params> | --page <key>, --overwrite | Captures HTML + meta for a fixture entry. Params are typed against the page's URLPattern. |
sitely fetch-fixtures | Re-fetches every fixture's URL, updating the on-disk HTML. | |
sitely openapi | --out <file> | Runs generateOpenApiSpec() and writes the resulting JSON. |
sitely init | Scaffolds a new site package. | |
sitely dev | Watch mode: re-runs validate+extract against fixtures as the author edits, with diffs. | |
sitely try <url> | One-shot live extraction against a real URL. | |
sitely diff <ref> | Compares extracted output against a baseline. | |
sitely doctor | Local health check — fixture freshness, schema validity, fixture-coverage gaps. | |
sitely list <subject> | Lists @sitely/schemas exports, declared TTL presets, or framework errors. | |
sitely migrate | Applies framework-version migrations. |
Every command supports --help.
Read next
- The build manifest — the single shared artifact and the field-by-field contract.
- Build subsystem — how the builder output becomes
dist/index.js+dist/manifest.json. - Test-pkg subsystem — the in-process runner and the eight checks.
- Site packages — what an
index.tslooks like in the wild. - @sitely/page — the DOM abstraction every extract function talks to.
- Glossary — terminology used across these pages.