Site counter / parser
What is it?
The parser is a program that converts text data into a structured format. In this case, it is a program that collects text information from a web page and counts the total amount of this information.
Site parsers or counters might be useful for translators or site owners in cases where it is necessary to perform a complete or partial translation of a site, but it is not possible to work with separate language files. In this case, in order to estimate the total cost of work, it is necessary to evaluate the volume of text on the entire site or pages of interest, which can take too much time in manual mode.
Why our parser?
The need for parsing is found not only in the field of translations but also in SEO (namely, filling the site), as well as in trading on all kinds of exchanges (continuous monitoring of course changes). Therefore, for a long time, there are ready-made solutions in the form of separate software modules or programs with a user interface on the Internet. However, most of them are inconvenient to use for the translation precisely because they were tooled for different operations. Existing parsers search for specific information on a particular site, while they must receive in advance instructions on what kind of information should be found. Our parser assembles textual data from any site without specific arrangements.
In addition to the parsers described earlier, there are also site counters, created specifically for translators. They perform what is required - they evaluate the text length of the entire site. Unfortunately, there are very few such services, and they are weak. In addition, they might include excess characters in the count, exaggerating the real text length several times. This is due to accidental second visits to processed pages and the failure to identify duplicates. Another major shortcoming of such counters is that they are just counters. They can not provide the text of each page. Therefore, if the translator evaluates the amount of work correctly, it is necessary to pull out the text from each page in a separate document manually. Our parser is also devoid of these shortcomings.
Features and limitations
Our parser takes into account information that is missed by most other counters, but it is significant for the localization and subsequent promotion of the site. In addition to the html tags themselves, the following are considered:
- Tag title
- Meta-tag description
- Values for the alt, title, value, and placeholder attributes
Compared to other counters, the algorithm for processing the site has been improved, which makes it possible to obtain adequate evaluations of the text length, namely:
- There is no difference between the http and https protocols (http://site.com = https://site.com)
- The subdomain www (site.com = www.site.com) is ignored
- Text elements that do not need translation, such as numbers and dates, are ignored
- We are checking for duplicate content, if there is a suspicion of duplication, you will get a warning
It should be understood that when localizing a site it is always better to work with specially prepared language files, the parser is a backup option. The following limitations must be taken into account:
- When processing the site, only the first 500 pages of the site will be counted
- When processing separate web pages, you can process up to 10 links at a time
- Detailed parsing is possible only when processing separate linksк
- The volume of each page processed should not exceed 1 MB
These limitations are caused by the capabilities of the server. We are working to expand them. Besides:
- Information that is only accessible to authorized users will not be processed (solved using cookie)
- Information that is displayed only after specific user actions, for example, registration confirmation page, filter output results, and so on will not be processed.
- Information that is not directly part of the site text will not be processed. For example, e-mail texts that are automatically sent to users, documents posted on the site and the like
Some of these limitations can not be circumvented programmatically.
How to start parsing?
Our parser can handle both separate links and entire sites.
If you would like to count the number of characters on the entire site, select the "Site" tab. In the input field enter the site address, for example, http://example.com . Click the "Count" button.
If you are only interested in several site pages or pages from different sites, select the "List of links" tab. Here you can specify up to 10 links, each with a new line. Note that when processing separate links, you have two options for action - counting and parsing. By clicking the "Count" button you will receive a short report containing information on the number of words and symbols on each page from the list. By clicking the "Parse" button, in addition to the same general report, you will also receive a table with all text content structured according to the HTML markup for each page in the list. For more information, see Processing Results .
This function is available when processing individual pages. Often links for translation are given in the form of a numbered list, or together with additional comments. Instead of deleting all the superfluous text manually, just paste it into the input field and click the "Extract links" button. All unnecessary text will be deleted automatically, in the field there will be only links, each with a new line.
If you checked this box before running the parsing, the results will not be displayed in the browser. Instead, the report will come to the mail you specify. Please note that this may take longer than parsing in the browser due to delays in the mail servers. If the email with the results does not come too long, check the Spam folder. If there is not, write in support: email@example.com.
This option is available only when parsing. It is installed by default for a more compact representation of the results. Trimming assumes that for all tags, all attributes except id and the closing tag are deleted. If for some reason this does not suit you, uncheck this box. In this case, all html-markup in the results of parsing will be presented in unaltered form.
This option is available when countiong the site. By default, it is disabled. In this situation, the parser will ignore all internal links of the site, leading to deeper subdomains. For example, when counting a site site.com link news.site.com will be processed only if the option "include subdomains" is set. Note that this option only works in one direction - on deeper subdomains. That is, when counting the site news.site.com link site.com will not be visited anyway.
Ignore request parameters in URL
This option is available when countiong the site. If it is enabled, the links that lead to the same web page but with different parameters will be considered as duplicate.
If you use the GET parameters on the site of interest to issue content, you should disable this option. For example:
These links lead to pages that are different in content, as specified in the id parameter, but the web page address remains the same. For the correct count of sites with this approach to content delivery, you'd better disable this option.
If the parameters determine not the actual content of the page, but only the way it is served, this option is better to be enabled so that you do not visit the same pages of the site again. For example:
These two links will produce the same content, but in the first case, the video will be launched automatically, which does not affect the number of characters on the page.
If the request parameters on the site of your interest are used for both of these purposes, disable this option. In this case, you will not miss the content that is meaningful for you, and in case of duplicates, the parser will give out a warning .
This option determines the source encoding of the site or pages that you are trying to parse. If the source encoding is not defined correctly, you are likely to get the correct results of the count, but when parsing all the text of a web page may look something like this:
Ð¡ÑÑÐ°Ð½Ð¸ÑÐ°, ÐºÐ¾ÑÐ¾ÑÑÑ ÐÑ Ð¸ÑÐµÑÐµ, Ð½Ðµ Ð½Ð°Ð¹Ð´ÐµÐ½Ð°. ÐÐ¾Ð·Ð¼Ð¾Ð¶Ð½Ð¾, ÐÑ Ð½ÐµÐ¿ÑÐ°Ð²Ð¸Ð"ÑÐ½Ð¾ Ð²Ð²ÐµÐ"Ð¸ URL Ð¸Ð"Ð¸ Ð²Ð¾ÑÐ¿Ð¾Ð"ÑÐ·Ð¾Ð²Ð°Ð"Ð¸ÑÑ Ð½ÐµÑÐ°Ð±Ð¾ÑÐµÐ¹ ÑÑÑÐ"ÐºÐ¾Ð¹.
The source encoding can be defined by http-headers, meta-tag charset or other software tools. Unfortunately, none of these methods gives 100% guarantee of correct determination. In most cases, automatic detection works correctly, but if it is not, you can select the desired encoding from the list. If you are the owner of the site and know what encoding it is, just select it from the list. Otherwise, you will have to experiment a little. In this case, we recommend starting with the most commonly used encodings:
- ISO-8859-1 (and other ISO)
- Windows-1251 (and other Windows)
Different sites use cookie for long-term storage of information about the user. In particular, they are used to carry out rapid authorization. If you selected the "Remember me" checkbox on authorization on a certain site, then you will be already authorized by the next visit to this site and this is implemented using cookie.
Using cookies when parsing, you can process the content of the site, available only to authorized users. To do this, select the "Use cookie" check box and enter all required name-value pairs.
Where to get cookie?
We'll describe where and how to find the name-value pairs for cookie using the Google Chrome browser. You can find the instructions on the Internet yourself how to do it in other browsers.
Go to the site you want to parse and log in. Open the developer mode (F12 key). Select the "Application" tab, then on the side panel "Cookie" and click on the entry corresponding to the site you are interested in. You will see a table of cookie. The first column is the name, the second is the value.
Note that the third column indicates the domain to which these files belong. Quite often there can be a lot of third-party sites, for example, various services of search engines, social networks and stuff. Most likely, they are not related to authorization on the site and are needed to connect additional functions or maintain statistics.
In addition, for some cookie names, you can guess that they are not related to content or authorization on the site. For example, names like last_visited or autoplay obviously should not affect the parsing process. But if you are not sure which of the cookie-files are needed and which ones are not, just copy them all.
Important . We do not save and do not give to anyone any data about the cookie-files used for parsing. However, we can not guarantee their 100% security. Do not use cookie-files that give access to your personal information, manage your finances or business. We recommend creating a separate secure account for parsing.
As a result of the parsing, you will receive a general report on all links visited and a detailed report on each page, if you performed parsing, and not just counting. The total report indicates the full scope of the document in the number of words, characters with spaces and characters without spaces. In addition, the header of the report also indicates the total number of processed pages, the number of successfully processed pages, warnings and errors. By clicking on the relevant links you can switch between categories. This can make your work easier, for example when checking for warnings.
Further, in the general report, the basic information about each separate page is presented. The number of words, characters without spaces and characters with spaces, warnings or errors, if any, are indicated. Thus, if some pages were not significant for you, you can simply subtract their figures from the total amount of the entire site.
When parsing, you will also receive a separate table for each successfully processed web page, containing all the text found on the page, taking into account its location in the html-markup. There are 6 columns in this table:
- Level - specifies the depth of this element in the html-markup hierarchy. It is assumed that the body tag has a level of 0. Each next nested element has a higher level per unit. This will help to accurately relate the various elements among themselves
- Container is an html-tag whose contents are represented in this table row. In addition to the contents of the tags themselves, the contents of some attributes are also processed. The name of the attribute will be indicated in parentheses immediately after the tag name
- Content - internal text and nested tags of this tag
- Words , Ch , ChWS - the corresponding characteristics of the contents of this line. If necessary, you can subtract these parameters from the general account if some items are insignificant
Lines in which there is no text are tinted in gray. They are not deleted because they carry information about the nesting of tags and are important for understanding the structure of the web page.
Let's say that the web page you are interested in has the following code:
<div id="result-example" class="someClass">
Some text in <i>paragraph</i>
<a href="#result-example" title="Example link">Example</a>
The HTML markup given by this code will look something like this:
And here is the table with the parsing results:
|1||<div id="result-example">||<h3> <p> <a>||0||0||0|
|2||<p>||Some text in <i>||3||10||13|
Please note that in the presented table with the results:
- The number specified in the level column corresponds to the depth of the element in the html-markup. For child elements, it is always one more than the parent
- In the container column, as well as in the content column, tags are specified without attributes (except id ) and closing tags. This is because the option trimming tags is enabled
- First, all the child elements of the tag are displayed, and then its subordinate neighbors. Therefore, in the above example, the <i> tag with level 3 (parent - tag <p>) goes first, and only then the <a> with level 2
- First, the contents of each tag attribute are displayed, and then its contents. Therefore, in the example, the first line is <a> (title), and only then <a>
- Symbols of the child elements are not counted as parent's one.
List of possible errors
Below is a list of errors that may occur when using the parser and instructions for resolving them. In the event that any parsing errors cannot be resolved independently, please contact us firstname.lastname@example.org.
#101: Possible duplicate
This warning appears when the number of words, characters with spaces and characters without spaces for the two specified pages is the same. Most likely these pages have the same content, but you better check it yourself each time. If the page is a duplicate, you can delete it from the count results.
#301: Invalid site address
The site address you want to count must match the format http://example.com , namely:
- Start with http: // or https: //
- Contain a domain or subdomain name with a domain zone extension, for example, http://example.com or http://www.news.example.com
- It may contain a trailing slash, but nothing after the slash, for example, http://example.com or http://example.com/ , but not http: //example.com/page
#302: Some links are incorrect.
All entered links should correspond to the format http://example.com/page , namely:
- Start with http: // or https: //
- Contain a domain or subdomain name with a domain zone extension, for example, http://example.com or http://www.news.example.com
- It may contain a page address, a trailing slash, a list of parameters, for example, http://example.com or http://example.com/page/ , or http: / /example.com/page?param=1
#303: Invalid cookie name format
Make sure you enter the correct cookie name. Cookie names can not contain characters such as =,; or a space.
#304: Invalid cookie value format
Make sure you enter the correct cookie value. Cookie values can not contain characters such as , , or a space.
#305: Invalid email specified
The e-mail format for sending parsing results is invalid. Be sure to enter the valid email.
#306: Duplicate Cookie Names
Cookie names must not be repeated. Be sure not to enter the same name twice.
#401: Failed to load page
Failed to load the page for unknown reasons. Restart the browser and try again. If the error does not disappear, the server-side problem is most likely. In this case, try again later.
#402: The file is not an html format
The file located at the specified link is not an html file, so it can not be parsed.
#403: The file size exceeds the maximum allowed
The size of the web page exceeds the maximum allowed. We are working to remove this limitation.
#404: Page not found
The specified page could not be found. Be sure to enter the correct address.
#405: An error has occurred:...
When the page was loaded, the server gave an error. The text and the error number will be presented unchanged.
#406: Failed to parse the page
The page can not be processed. Perhaps the page is not a valid html file.
#501: Invalid session number, or session is corrupt
Part of the data was lost due to the session corrupt. Restart the browser and try again. If the error does not disappear, the server-side problem is most likely. In this case, try again later.
#502: Expected URL not found
The next URL in the processing queue was not found. Restart the browser and try again. If the error does not disappear, the server-side problem is most likely. In this case, try again later.
#503: Failed to delete, an error occurred.
The server did not respond when trying to delete parsing results. If there are not many items requiring removal, save the report and subtract the items manually. If you need to delete automatically, restart the browser and try again. If the error does not disappear, the server-side problem is most likely. In this case, try again later.
#504: A request error occurred.
Connection to the server is broken. Restart the browser and try again. If the error does not disappear, the server-side problem is most likely. In this case, try again later.