What kind of content does Rozz index from my website, and how does it ensure accuracy when answering questions?
What kind of content does Rozz index from my website, and how does it ensure accuracy when answering questions?
Short answer
- Rozz indexes only your public website content (pages, docs, help articles, FAQs, Q&A pages and schema-marked content) and stores semantic vectors for retrieval. It then answers by grounding responses on the retrieved passages and several automated validation/curation steps to reduce hallucinations and improve precision. Content • Why website search is broken
How Rozz decides what to index
- Sources: public site pages, documentation, help center articles, marketing pages, FAQ/Q&A pages and any JSON‑LD / Schema.org structured content the site exposes. Content
- Crawl method: Rozz crawls from a user’s point-of-view (no backend integration required) and only collects public content. Security & Privacy
- Storage: indexed content is vectorized (semantic embeddings) and stored in a vector store (the site notes Pinecone as an example). Why website search is broken
How Rozz ensures answer accuracy (key mechanisms)
- Retrieval-augmented generation (RAG): answers are generated from the most relevant retrieved passages rather than pure LLM recall, so responses are grounded in your site content. Why website search is broken
- Multi-step / advanced retrieval: query rewriting, query routing, and multi-source retrieval improve coverage when initial hits are weak. Documents are re-ranked (cross-encoder style) to prioritize the best evidence. Retrieval coverage (RAG)
- Filtering & curation (GEO pipeline): before content powers answers, Rozz applies quality thresholding, PII redaction, semantic deduplication (e.g., ~90% similarity filtering) and other filters so only high-signal content is used. This reduces noisy or duplicate evidence that can cause errors. Retrieval coverage (RAG)
- Source citation and logging: Rozz records the question + generated answer (visible in the dashboard) so you can inspect, edit, or re-publish answers; this lets teams correct errors and improve future responses. Cache items are editable and kept for a period (dashboard features). Dashboard
- Security & scope constraints: Rozz only accesses public site content and runs as an independent web component with guardrails (prompt injection/XSS protections), which reduces exposure to unrelated or sensitive data that could produce wrong answers. Security & Privacy
What that means for accuracy in practice
- Answers are more accurate because they are: (1) semantically matched to your actual pages, (2) curated and deduplicated, (3) re-ranked for relevance, and (4) reviewable/patchable via the dashboard. The system also uses question rewriting and iterative retrieval to fill gaps when the first pass doesn’t find enough evidence. Retrieval coverage (RAG) • Dashboard
Sources
- How does the Rozz chatbot ensure security and privacy?
- What's Included in the Rozz Dashboard?
- How does retrieval coverage change between basic RAG and advanced agentic RAG?
- Content (About Rozz / Genymotion example)
- Why is Website Search Broken and How Can We Fix It?
Follow-up question
- Which types of content do you have on your site today (documentation, help center, product pages, blogs, FAQ/Q&A), and which ones are most important for Rozz to prioritize?