Technical Description – Edgar Azuara's Portfolio

Edgar Azuara

Class: Eng 21007

Introduction:
Search engines are digital systems made to locate information across the internet by crawling, indexing, and ranking webpages based on their relevance of a user’s question. They serve as the gateway through which billions of people access the web on a daily basis, using algorithms that evaluate keywords, page structure, users usage, and other signals to return the most useful results. Because of their efficiency and scale, search engines have become essential tools for learning, business, research, and navigation of online information.

Brief History:
The earliest search engines appeared in the early 1990s, beginning with Archie (1990), which indexed FTP directories rather than webpages. As the World Wide Web expanded, tools such as Lycos (1994), AltaVista (1995), and Yahoo! Directory emerged, each improving how online content was cataloged and retrieved. The field changed dramatically in 1998 when Google introduced PageRank, an algorithm that ranked websites based on the number and quality of hyperlinks pointing to them. This shift from simple keyword matching to relevance ranking based on link authority transformed search effectiveness and set the foundation for modern search engines, which now integrate machine learning, natural language processing, and sophisticated ranking models.

Pros of Search Engines:
Search engines provide rapid access to lots and lots of amounts of information, making research and information retrieval far easier than manual browsing. They enhance productivity by ranking results by relevance, filtering out spam, and offering advanced tools such as image search, news filtering, and scholarly search portals. For businesses and creators, search engines are vital for visibility, allowing users to discover content through organic search results. Modern search engines also assist users with personalized recommendations and query suggestions, improving the overall search experience.

Cons of Search Engines:
Despite their benefits, search engines also bring challenges. Ranking algorithms can create bias by prioritizing popular or optimized sites rather than the most accurate ones. Personalization features may reinforce “filter bubbles,” exposing users to narrower viewpoints. Search engines also gather significant amounts of data, raising privacy concerns over how search history and behavior are tracked or used for targeted advertising. Additionally, website owners sometimes manipulate rankings through tactics like keyword stuffing or link schemes, which can distort search quality.
Studies show that algorithmic ranking systems can unintentionally reinforce social and informational inequalities, highlighting the need for transparency and accountability in search technologies (Noble, 2018).

Conclusion:
Search engines have become indispensable tools for navigating an otherwise overwhelming amount of online content. While they offer remarkable efficiency, accessibility, and organization, they also raise issues related to privacy, bias, and information quality. Understanding both the strengths and limitations of search engines is essential for using them effectively and critically. As technologies evolve, ongoing research and thoughtful design will continue to shape how search engines influence society and information access.

Keywords:

URL Server: A component that maintains the master list of web addresses (URLs) to be crawled. It schedules and sends lists of URLs to the crawlers, keeping track of metadata such as when a page was last fetched.

Crawler: An automated software program (also known as a “spider” or bot) that systematically browses the World Wide Web to fetch and download web pages. It starts from a set of seed URLs, follows hyperlinks on those pages, and sends the raw page content and newly discovered URLs back to other system components for processing and storage.

Store Server: This server packages the raw, downloaded web pages from the crawler for storage. It compresses the pages and stores them in the repository, often assigning a unique document ID (docID) to each page.

Repository: A centralized storage system that holds the entire collection of raw, compressed HTML content of all the web pages that have been crawled. It acts as the definitive archive of the downloaded web.

Indexer: A component that reads the content from the repository, uncompresses and parses each document to create a searchable index. It extracts words (or “hits”), records their positions, font sizes, and capitalization, and distributes this information into data structures called barrels.

Anchors: Refers to the anchor text (the clickable text part of a hyperlink) and the associated source and destination URLs found during the crawling process. This information is stored in a dedicated anchor file and is used later to determine link relationships and help calculate page importance (like PageRank).

URL Resolver: This component reads the anchors file, converts relative URLs into absolute ones, and resolves them into their corresponding unique document IDs (docIDs) in the system. It also integrates the anchor text into the forward index, associating it with the document the link points to.

Barrels: Data files (often sharded or partitioned) used to store the partially sorted forward index created by the indexer. Each barrel contains “posting lists” (details about word occurrences) for a subset of terms to allow for efficient sorting and fast lookups during the search process.

Sources:

Noble, Safiya Umoja. Algorithms of Oppression: How Search Engines Reinforce Racism. NYU Press, 2018.
Brin, Sergey, and Lawrence Page. “The Anatomy of a Large-Scale Hypertextual Web Search Engine.” Computer Networks and ISDN Systems, 1998.
Sullivan, Danny. “A Brief History of Search Engines.” Search Engine Land, 2008.
Battelle, John. The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture. Portfolio, 2005.
Stanventures. (n.d.). Top 10 search engines in the world (2025 updated list). Stanventures. https://www.stanventures.com/blog/top-search-engines-list/
Freedom Consulting Group. (2016, November 17). All the firsts: Search engines. https://freedomconsultinggroup.com/2016/11/17/all-the-firsts-search-engines/ https://tenor.com/view/s-gif-16210546847389979593
PromptCloud. 2016. Different Components of a Crawlable Search Engine. PromptCloud blog
https://www.promptcloud.com/blog/components-crawlable-search-engine/