Today Prabook covers roughly 3 million biographies of people whose lives are spread over the time from BC until nowadays. That makes Prabook foremost a biographical database. The concept of an online platform that is built around it, with all the features and user integration services, is primarily used for the presentation of information; but the main value remains in data itself.
Gathering this data is a key mission of Prabook. Our site allows everyone to create articles on themselves or other people, and significant amount of Prabook content has indeed come via users’ contribution. However, most of the information on the site originates from various archival biographical sources, such as international and national Who’s Who editions, biographical dictionaries, and other credible publications.
In order to get on Prabook, biographical sources should provide only factual information. This condition not only makes Prabook content more accurate and less prone to possible controversies, but also protects it from copyright claims. According to both American and European law, mere collections of facts are considered unoriginal and thus not protected by copyright. Our contractors receive the task to search for factual biographical sources in order to bring this information on Prabook. Usually those are books that require scanlation and parsing. In this case scanned pages are processed into an array of text which is automatically refined, and proceed with special text-parsing software. The parser determines the beginning and the end of each biography based on paragraphs, which allows it to calculate their number and outline them into separate articles.
At this point, a lot of other online biographical databases stop and show such digitized articles as they are. However, there is more use of information if it’s presented in a more approachable and user-friendly manner. The first obvious obstacle for that is the presence of abbreviations. Deciphering them not only makes the text more comprehensible but also allows it to be further analyzed.
The parser then breaks the whole text of each article into sentences and analyzes them one by one, considering also their position in a given article (e.g. first paragraph, middle, bottom, etc). This program uses so-called "regular terms", i.e. words which indicate the subject of their containing sentences. For example, the word "Nationality" likely states something about the nationality of a person, especially if it's followed by another word with a capital letter (like 'American' or 'Nigerian'). Similarly, the word "born" listed at the beginning of a biography almost certainly indicates that the sentence which contains it tells something about the circumstances of a person's birthday. Using the dictionary of regular terms, the parser structures the text by moving each sentence into one of Prabook fields.
Another challenge, however, is having a database that consists of unique profiles – i.e. not having any duplicated articles. Without processing the biographical text as described above, some biographical databases often provide a list of nearly identical results on a specific search query, having their databases inflated several times with redundant articles.
To solve this problem, we are using a merging algorithm which identifies duplicate articles based on several key fields which are name + surname (including their spelling variance) and date of birth as well as text similarity. As a result, unique individualized profiles that make Prabook different from other biographical online encyclopedias are obtained out of a simple array of text.
This method gives the majority of Prabook’s overall biographical content, as well as all of its subscription content. Some of the most unbeknownst biographies from old editions of worn biographical dictionaries gain new life in a form of Prabook profiles due to this automatic processing.