How Do Search Engines Work? Step-by-Step Explained
Step 1: Crawling (What Is Crawling and Indexing?)
Crawling is how a search engine finds pages that exist on the web. It’s the discovery phase, nothing more. Google’s main crawler is called Googlebot, and it works a lot like a spider moving across a web of connected pages, following link after link to see where each one leads. That’s actually why crawlers got the nickname “spiders” or “bots” in the first place.
A crawler discovers new URLs in three main ways:
- Following links. If a page Googlebot already knows about links to a new page, the bot follows that link and adds the new page to its list of pages to visit.
- Reading XML sitemaps. A sitemap is a file a website owner submits that lists every URL on the site, basically a map handed directly to the crawler so nothing gets missed.
- Direct submission. Site owners can request a crawl through Google Search Console, the free tool Google provides for monitoring how your site shows up in search.
One thing that confuses a lot of beginners: crawling and indexing are not the same step. Google’s own Search Central documentation treats them as two separate stages, and mixing them up is one of the most common mistakes new website owners make. A page can be crawled and never indexed. Getting visited by the bot doesn’t guarantee a spot in the database.
There’s also a limit to how much a crawler can do. Google calls this crawl budget, the number of pages Googlebot will crawl on a given site within a given timeframe. Crawl budget matters most for huge sites with millions of pages. A ten-page blog rarely needs to think about it; a sprawling e-commerce site with half a million product pages absolutely does.
A few files control crawler behavior directly:
- robots.txt: a text file at the root of a domain that tells crawlers which parts of the site they’re allowed to visit.
- XML sitemap: the page list mentioned above.
- noindex tag: an instruction telling a search engine “you can crawl this page, but don’t add it to your index.”
Step 2: Indexing
If a crawled page passes muster, it gets added to the search index, a massive, constantly updated database of web pages. Think of the index as the search engine’s actual storehouse of information. The live web has billions of pages; the index holds the subset Google has decided are worth storing and serving to users.
Not every crawled page makes it in. Pages get left out of the index for a handful of common reasons:
- Thin content. Pages with very little substance, just a few sentences or a near-empty template, often don’t clear the bar.
- Duplicate content. If a page is a near-copy of another page already indexed, on the same site or a different one, Google typically picks one version and skips the rest.
- Blocked by robots.txt or a noindex tag. The site owner explicitly told search engines not to index it.
- Technical errors. Server errors, broken redirects, or pages that simply won’t load reliably can keep content out of the index entirely.
A page can also get removed from the index after the fact. This is called being deindexed, and it can happen because of a manual action (a human reviewer at Google flagged a policy violation) or because an algorithmic update reassessed the page’s quality and decided it no longer belongs.
Here’s the distinction worth remembering: the live web and the index are not the same thing. Google isn’t searching the internet in real time when you hit enter. It’s searching its own pre-built index, which is why brand-new pages sometimes take hours or days to show up in results. They have to be crawled and indexed first.
Step 3: Ranking
Ranking is the step that decides the order. Once a query comes in, the search engine scans its index for pages that match, then sorts them using an algorithm, a set of rules and calculations designed to surface the most useful results first.
Google has confirmed it weighs well over 200 signals when ranking pages, though it has never published a complete list, and no one outside Google’s ranking team knows the full picture or the exact weight given to each factor. What’s confirmed and well-documented falls into a few broad buckets:
Signal Category | What It Measures | Example |
Relevance | How closely the content matches the query’s intent | Does the page actually answer the question asked? |
Quality and Expertise | Whether the content demonstrates real knowledge, often described through Google’s E-E-A-T framework (Experience, Expertise, Authoritativeness, Trustworthiness) | Author credentials, depth of explanation, original insight |
Page Experience | Technical and usability factors, including Core Web Vitals | Page load speed, mobile usability, visual stability |
Freshness | How recently the content was published or updated, weighted more heavily for time-sensitive queries | A page about “best laptops 2026” updated this month versus one from 2021 |
Authority | Signals of trust built over time, including backlinks from other reputable sites | A medical page linked to by established health publishers |
These buckets aren’t a complete list, and they’re not weighted equally for every query. A page about a recipe and a page about a medical symptom get evaluated through very different lenses; Google calls the second category YMYL (Your Money or Your Life) content, and it’s held to a noticeably higher bar for accuracy and authorship. For deeper coverage of individual ranking signals, see our dedicated Ranking and Algorithms pages.
Google also runs broad recalibrations of its ranking systems several times a year, known as core updates. The most recent one, the March 2026 Core Update, rolled out between March 27 and April 8, 2026, and according to coverage from Search Engine Land, nearly 80% of top-three results shifted and almost one in four top-10 pages fell out of the top 100. Updates like this don’t punish sites directly. They re-evaluate the entire field of indexed pages against the same signals and reorder accordingly, so sites that aren’t doing anything wrong can still see movement if competing pages now satisfy those signals better.
A Word on PageRank, Then and Now
PageRank was the original algorithm Google’s founders, Larry Page and Sergey Brin, built the company on back in 1998. The core idea was simple: a page is more important if other important pages link to it. That concept hasn’t disappeared, but PageRank today is one signal among hundreds, not the dominant factor it once was in Google’s early years. If you’ve heard someone claim PageRank “is” the ranking algorithm, that’s outdated. It’s a component, not the whole system.
A Simple Real-World Analogy
Picture a massive public library. Crawling is the librarian walking the stacks, noting which books exist and where they sit. Indexing is the library’s card catalogue, the organized record of every book the librarian decided was worth cataloguing, complete with subject, author, and summary. Ranking is what happens when you ask the librarian for the best book on a topic: they don’t hand you every book in the catalogue, they hand you the ones most likely to actually answer your question, in order of usefulness.
A book the librarian never noticed (never crawled) can’t appear in the catalogue. A book in the catalogue that’s poorly organized or duplicated elsewhere might get skipped over when you ask for the best recommendation. That’s ranking at work.
What Happens After Ranking? (Serving Results / SERP)
Once pages are ranked, the search engine assembles the SERP, the search engine results page you actually see. In 2026, that page looks a lot more crowded than it did even a few years ago. Traditional blue links now share space with featured snippets, People Also Ask boxes, knowledge panels, and increasingly, AI Overviews, Google’s AI-generated summaries that appear above traditional results for many queries.
Featured snippets used to be the prize everyone chased, the single highlighted answer box at the top of the page. In 2026 they’re sharing that real estate with AI Overviews, and that’s changed what ranking well actually looks like. Showing up in position one no longer guarantees a click; more than half of searches now end without the user clicking through to any website at all, a trend often called zero-click search. Figures like this shift quickly, so treat any specific percentage as a snapshot rather than a fixed number. For a full breakdown of how SERP features and AI Overviews work, see our dedicated SERP features pillar page.
FAQ: Crawling and Indexing Questions
What is crawling and indexing in simple terms?
Crawling is how a search engine discovers a page exists. Indexing is the decision to store that page in the searchable database. A page can be crawled without ever being indexed.
How does a search engine crawler work?
A crawler like Googlebot starts from a known set of URLs, follows links from those pages to discover new ones, reads sitemaps that site owners submit, and repeats the process continuously. It respects rules set in a site’s robots.txt file along the way.
Why isn't my page showing up in Google?
The most common reasons are that it hasn’t been crawled yet, it’s blocked by a noindex tag or robots.txt rule, or it was crawled but judged too thin or too similar to existing content to index. Checking Google Search Console is the fastest way to find out which applies.
Is crawling the same as ranking?
No. Crawling is discovery, indexing is storage, and ranking is the ordering that happens at the moment someone searches. A page goes through all three stages separately, and clearing one doesn’t guarantee it clears the next.
Does Google index the entire internet?
No. Google’s index holds a large but selective portion of the web. Pages get excluded for technical errors, thin or duplicate content, explicit noindex instructions, or policy violations.