Yesterday I posted about Semalt.com’s crawler and their unusual choice not to have their crawlers identify themselves as web crawlers or obey robots.txt, causing heartaches for analytics loving webmasters across the web. Semalt’s manager Alex Andrianov reached out through twitter and offered to answer some of my questions via email. The exchange is included in whole below.
|Hi, thanks for taking the time to chat with me in a bit more detail about Semalt. Happy to update my blog post with factual corrections you’re able to provide.You’ve mentioned on twitter that Semalt does not obey robots.txt, further saying that “can’t change it”. Could you explain in a bit more detail what keeps Semalt’s bots from identifying themselves as bots or obeying robots.txt? Is this a talent issue, where your developers haven’t been able to discover the processes to undertake this, or is this part of a business decision on Semalt’s part?Are there plans in the future to have Semalt’s bots identify themselves properly as crawlers and to obey robots.txt?
You also claimed that my comments at http://www.closetoclever.com/semalt-com/ were incorrect, as I was not a Semalt client. Were there any specific factual errors that you would like to address?
Thanks again for taking the time to answer these,
|Hello Jessica RoseThanks for your email.First of all I would like to bring apology on behalf of my company if our bots caused you some difficulties. I can assure you, all the visits on your website were accidental. At this moment our specialists are taking drastic actions to prevent these visits. Thank you for pointing to our service drawbacks. We appreciate your help and it is very important to us.
Our service has been launched quite recently and unfortunately there are still some bugs and shortcomings. Please, respect this fact. We are working hard trying to fix the existing errors and I hope soon our users won’t have any claims.
As you might notice, every user can manually remove their URLs from the Semalt database using Semalt Crawler. Furthermore, our Support center specialists are ready to come to the aid and remove URLs from the base once the website owner submits a request. We consider every single request and guarantee that every user will get a proper respond.
We realize this may bring some inconveniences, but unfortunately at the moment we can’t offer another way of solving this issue.
As for the comment posted on your blog, I believe it’s impossible to evaluate all the pros and cons unless you have the complete picture of the service. Probably once you try to use Semalt features you will change your mind.
Anyway, we thank you for your feedback, since we appreciate every opinion relating Semalt.
Semalt LLC manager , Alex Andrianov
|Thanks for the response, but would it be possible to have you address my specific questions more directly?1. Are you claiming that your bots’ failure to identify themselves as web crawlers is due to a technical failure?
2. Are you claiming that your bots not obeying robots.txt is due to a technical failure?
3. Do you have plans to make your bots identify themselves as web crawlers?
4. Do you have plans to have your bots comply with robots.txt?Jessie
|Dear Jessica,I will try to give the most definite answers to your questions. As I mentioned before our service has recently appeared on the web which causes some technical unavailability. Today we upgrade the web scanning process and adjust our robots. Unfortunately sometimes Semalt bots visit random websites, but we do all our best to solve this problem in the shortest possible time.Thank you for your email and interest to Semalt.com service. Your opinion is very important to us.
Semalt LLC manager, Alex Andrianov
|I’m not sure that’s answering much. I’m really looking to find out:1. Will your crawlers be respecting robots.txt after your upgrade?
2. Will your crawlers be identifying themselves as web crawlers after your upgrade?Jessie
What we learned from this exchange:
Nothing, really. There were some vague claims that the problems I’ve listed were “bugs” but no specific addressing of the problems of Semalt bots ignoring robots.txt or failing to properly identify themselves as web crawlers. Apparently several weeks of visits to sites across the web were “accidental”.
Why this is nonsense:
Given how easy creating robots.txt compliant crawlers are, failure of bots to identify themselves as web crawlers or obey robots.txt can only be viewed as a deliberate choice of the designer or gross incompetence. While my technical skills are also substandard, I’m confident that I would be able to put together a simple webcrawler that obeys robots.txt over the weekend (check in on Tuesday, I’ll be posting the results of my efforts). For a professional enterprise who sources data through crawling the web to claim following industry conventions is beyond their technical ability leaves me wondering if they’re fools or liars.