The Complete Mystery of Madeleine McCann™
Welcome to 'The Complete Mystery of Madeleine McCann' forum 🌹

Please log in, or register to view all the forums as some of them are 'members only', then settle in and help us get to the truth about what really happened to Madeleine Beth McCann.

When you register please do NOT use your email address for a username because everyone will be able to see it!

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Mm11

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Regist10
The Complete Mystery of Madeleine McCann™
Welcome to 'The Complete Mystery of Madeleine McCann' forum 🌹

Please log in, or register to view all the forums as some of them are 'members only', then settle in and help us get to the truth about what really happened to Madeleine Beth McCann.

When you register please do NOT use your email address for a username because everyone will be able to see it!

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Mm11

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Regist10

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Page 27 of 33 Previous  1 ... 15 ... 26, 27, 28 ... 33  Next

View previous topic View next topic Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by Syn 07.07.15 21:40

SixMillionQuid wrote:
Tony Bennett wrote:
Nuala wrote:@ Tony Bennett

As a non-techie, what arguments from Whodunit persuaded you that these captures were also correct:

[You must be registered and logged in to see this link.]

Because if the mccann.html capture was correct, then those are correct as well, along with the thousands of other CEOP website examples also given the same 30 Apr 2007 date and time.

The above examples are only a tiny sample of the masses of news articles given a date of 30 Apr 2007, when the said articles hadn't even been published on that date. Note that the date of the articles is the date CEOP gave them when they published them, so 20070810 isn't a date from Wayback, it's a date from CEOP.

CEOP dated them 20070810 and when Wayback archived them it gave them a date of 30 Apr 2007.

I think you would agree that it's impossible for an article dated 20070810 and therefore not even in existence on 30 Apr 2007 to have been crawled by Wayback and correctly dated on 30 Apr 2007.

I think even a non-techie can see that.

So can you tell me what persuaded you that those news articles are in fact correctly dated as being in existence on 30 Apr 2007?
I follow your argument - and, speaking as a non-tecchie, if neither whodunit nor anyone else can supply a good answer to your point, I would declare:

'Advantage Nuala'
Sorry but why do those four links quoted take you directly to the ceop page? Where are the Wayback archived versions?
Original Press Release dated and uploaded by CEOP on 20070810:

[You must be registered and logged in to see this link.]

[You must be registered and logged in to see this image.]

Here are the 4 Wayback Source Directory links claiming a 20070430 archive date that Nuala was talking about:

[You must be registered and logged in to see this link.]  
[You must be registered and logged in to see this link.]

[You must be registered and logged in to see this image.]

Here they are in the Wayback Calendar  - 4 entries archived between 27th August and February 9th 2008

[You must be registered and logged in to see this link.]

[You must be registered and logged in to see this image.]

All present and correct and NONE of them archived
on 30th April 2007 despite what the WB source directory claims.
avatar
Syn

Posts : 109
Activity : 110
Likes received : 1
Join date : 2015-06-20

Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by suzysu 07.07.15 21:58

@Syn ".....All present and correct and NONE of them archived
on 30th April 2007 despite what the WB source directory claims."

I'm sure it's maddening to you, as a techie, to have to deal with non-tech questions, but don't you find it ODD that an organisation as otherwise-credible as WBM hasn't come out publicly with an explanation?
avatar
suzysu

Posts : 52
Activity : 83
Likes received : 25
Join date : 2014-10-06

Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by Syn 07.07.15 22:09

suzysu wrote:@Syn ".....All present and correct and NONE of them archived
on 30th April 2007 despite what the WB source directory claims."

I'm sure it's maddening to you, as a techie, to have to deal with non-tech questions, but don't you find it ODD that an organisation as otherwise-credible as WBM hasn't come out publicly with an explanation?
No not maddening at all suzysu :) And no I don't find it odd at all. They are a non profit organisation who do not have to explain anything to anyone.  They have had the decency to let us know that after further investigation the urls in question were archived incorrectly due to a subset issue which they are trying to resolve. They did not actually have to tell us anything.
avatar
Syn

Posts : 109
Activity : 110
Likes received : 1
Join date : 2015-06-20

Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by suzysu 07.07.15 22:18

Syn wrote:
suzysu wrote:@Syn ".....All present and correct and NONE of them archived
on 30th April 2007 despite what the WB source directory claims."

I'm sure it's maddening to you, as a techie, to have to deal with non-tech questions, but don't you find it ODD that an organisation as otherwise-credible as WBM hasn't come out publicly with an explanation?
No not maddening at all suzysu :) And no I don't find it odd at all. They are a non profit organisation who do not have to explain anything to anyone.  They have had the decency to let us know that after further investigation the urls in question were archived incorrectly due to a subset issue which they are trying to resolve. They did not actually have to tell us anything.
Thank you Syn for not finding my question maddening :)

I accept that they're a non profit organisation, but even then, they are HUGE and, as we have been told, their data is (or has been) relied upon in court. 

In order to preserve their integrity, wouldn't one expect (and have a right to expect) that if they have 'archived incorrectly' they have a duty to explain the error? If they don't, how can their data ever be relied upon in the future?

This is making a mockery of their entire raison d'etre.
avatar
suzysu

Posts : 52
Activity : 83
Likes received : 25
Join date : 2014-10-06

Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by Suspicious Mind 08.07.15 8:46

Please excuse my ignorance when I ask what the debate is with regards to this Wayback site? I only joined this site yesterday and had minimal time to look through it. But, I noticed there seems to be quite a stir caused by the fact the Madeleine McCann case ended up on this Wayback site. What exactly is the issue, if any? What is Wayback? Can someone please enlighten this newcomer? smilie
avatar
Suspicious Mind

Posts : 10
Activity : 10
Likes received : 0
Join date : 2015-07-07

Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by Tony Bennett 08.07.15 8:56

Suspicious Mind wrote:Please excuse my ignorance when I ask what the debate is with regards to this Wayback site? I only joined this site yesterday and had minimal time to look through it. But, I noticed there seems to be quite a stir caused by the fact the Madeleine McCann case ended up on this Wayback site. What exactly is the issue, if any? What is Wayback? Can someone please enlighten this newcomer?
REPLY: Wayback Machine is a vast archive which preserves for posterity actions which take place on the internet - such as the creation and alteration of websites, and individual pages on those websites.

A Brit living in the U.S. called Steve Marsden said he found a record of CEOP - the Child Exploitation and Online Protection Centre - having created a page about Madeleine McCann before 11.58am on 30 April 2007. This, if true, would suggest that something bad happened to Madeleine before then and not on 3 May 2007 as the McCanns claim.

The view espoused especially by posters Nuala and Syn on this forum is that this was an unfortunate (but as yet unspecified) 'glitch' in Wayback's system, i.e. a mistake.

____________________

Dr Martin Roberts: "The evidence is that these are the pjyamas Madeleine wore on holiday in Praia da Luz. They were photographed and the photo handed to a press agency, who released it on 8 May, as the search for Madeleine continued. The McCanns held up these same pyjamas at two press conferences on 5 & 7June 2007. How could Madeleine have been abducted?"

Amelie McCann (aged 2): "Maddie's jammies!".  

Tony Bennett
Tony Bennett
Researcher

Posts : 16906
Activity : 24770
Likes received : 3749
Join date : 2009-11-25
Age : 76
Location : Shropshire

Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by rustyjames 08.07.15 9:19

suzysu wrote:
Syn wrote:
suzysu wrote:@Syn ".....All present and correct and NONE of them archived
on 30th April 2007 despite what the WB source directory claims."

I'm sure it's maddening to you, as a techie, to have to deal with non-tech questions, but don't you find it ODD that an organisation as otherwise-credible as WBM hasn't come out publicly with an explanation?
No not maddening at all suzysu :) And no I don't find it odd at all. They are a non profit organisation who do not have to explain anything to anyone.  They have had the decency to let us know that after further investigation the urls in question were archived incorrectly due to a subset issue which they are trying to resolve. They did not actually have to tell us anything.
Thank you Syn for not finding my question maddening :)

I accept that they're a non profit organisation, but even then, they are HUGE and, as we have been told, their data is (or has been) relied upon in court. 

In order to preserve their integrity, wouldn't one expect (and have a right to expect) that if they have 'archived incorrectly' they have a duty to explain the error? If they don't, how can their data ever be relied upon in the future?

This is making a mockery of their entire raison d'etre.

@suzysu - they are not huge.  From wikipedia there are about 200 employees, a large proportion of which are involved in book scanning.  Their raison d'être is to create a digital library of cultural artifacts to try and prevent this current era becoming a digital dark age.  See below from their "About" page:

Why the Archive is Building an 'Internet Library'

Libraries exist to preserve society's cultural artifacts and to provide access to them. If libraries are to continue to foster education and scholarship in this era of digital technology, it's essential for them to extend those functions into the digital world.
Many early movies were recycled to recover the silver in the film. The Library of Alexandria - an ancient center of learning containing a copy of every book in the world - was eventually burned to the ground. Even now, at the turn of the 21st century, no comprehensive archives of television or radio programs exist.
But without cultural artifacts, civilization has no memory and no mechanism to learn from its successes and failures. And paradoxically, with the explosion of the Internet, we live in what Danny Hillis has referred to as our "digital dark age."
The Internet Archive is working to prevent the Internet - a new medium with major historical significance - and other "born-digital" materials from disappearing into the past. Collaborating with institutions including the Library of Congress and the Smithsonian, we are working to preserve a record for generations to come.
Open and free access to literature and other writings has long been considered essential to education and to the maintenance of an open society. Public and philanthropic enterprises have supported it through the ages.
The Internet Archive is opening its collections to researchers, historians, and scholars. The Archive has no vested interest in the discoveries of the users of its collections, nor is it a grant-making organization.
At present, the size of our Web collection is such that using it requires programming skills. However, we are hopeful about the development of tools and methods that will give the general public easy and meaningful access to our collective history. In addition to developing our own collections, we are working to promote the formation of other Internet libraries in the United States and elsewhere.
avatar
rustyjames

Posts : 293
Activity : 314
Likes received : 3
Join date : 2013-10-16

Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by rustyjames 08.07.15 9:27

Syn wrote:
suzysu wrote:@Syn ".....All present and correct and NONE of them archived
on 30th April 2007 despite what the WB source directory claims."

I'm sure it's maddening to you, as a techie, to have to deal with non-tech questions, but don't you find it ODD that an organisation as otherwise-credible as WBM hasn't come out publicly with an explanation?
No not maddening at all suzysu :) And no I don't find it odd at all. They are a non profit organisation who do not have to explain anything to anyone.  They have had the decency to let us know that after further investigation the urls in question were archived incorrectly due to a subset issue which they are trying to resolve. They did not actually have to tell us anything.

Syn - I generally agree with most of your posts, but can you explain what you consider a "subset issue" to be as my view is the continued use of the terminology "subset" is a case of Chinese whispers.

For reference my take on it in response to Tony where he quoted the manual section you'd highlighted is here - [You must be registered and logged in to see this link.]
avatar
rustyjames

Posts : 293
Activity : 314
Likes received : 3
Join date : 2013-10-16

Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by Rufus T 08.07.15 10:18

Posted with HKP's permission from MMM:


+++++++++++++++++



QUOTE

Here’s a dilemma, in looking at the captures something else jumps out, please read the extract from Wikipedia this hopefully will be self explanatory, note the highlighting is mine.

QUOTE WIKIPEDIA

Robots Exclusion Standard

The robots exclusion standard, also known as the robots exclusion protocol or robots.txt protocol, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies the instruction format to be used to inform the robot about which areas of the website should not be processed or scanned. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code. Not all robots cooperate with the standard including email harvesters, spambots and malware robots that scan for security vulnerabilities. The standard is different from, but can be used in conjunction with Sitemaps, a robot inclusion standard for websites.

When a site owner wishes to give instructions to web robots they place a text file called robots.txt in the root of the web site hierarchy (e.g. [You must be registered and logged in to see this link.] This text file contains the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the web site. If this file doesn't exist, web robots assume that the web owner wishes to provide no specific instructions, and crawl the entire site.

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operate on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.


UNQUOTE WIKIPEDIA

When applying this standard to the ceop captures the results are very interesting in April 07 there was 102 robot.txt urls captured at least one for every day (obviously some days were more (75 on 25th for some reason) and others were singular like the 26th & 28th (it should be noted not every day was crawled, 15 in total including 30th). Now given what we have read above the 30th needs to be looked at.

30/04/07 No robot.txt urls captured,


Given the above statement of ‘if this file doesn't exist web robots assume the web owner wishes to provide no specific instructions, and crawls the entire site’ is this what happened and the entire site was crawled picking up mccann.html & madeleine 01 & 02 jpgs??? Obviously there are still question marks around captures with future dates that still need to be explained.

I'd appreciate Resistor's opinion and a post onto CMOMM.


UNQUOTE

[Post re-formatted, and edited for clarity, by a Mod]

 
[You must be registered and logged in to see this image.]
Hongkong Phooey
Posts: 192
Join date: 2014-08-30

[You must be registered and logged in to see this image.] 
Rufus T
Rufus T

Posts : 269
Activity : 312
Likes received : 3
Join date : 2013-06-18
Location : Glasgow

Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by Suspicious Mind 08.07.15 14:07

Tony Bennett wrote:
Suspicious Mind wrote:Please excuse my ignorance when I ask what the debate is with regards to this Wayback site? I only joined this site yesterday and had minimal time to look through it. But, I noticed there seems to be quite a stir caused by the fact the Madeleine McCann case ended up on this Wayback site. What exactly is the issue, if any? What is Wayback? Can someone please enlighten this newcomer?
REPLY: Wayback Machine is a vast archive which preserves for posterity actions which take place on the internet - such as the creation and alteration of websites, and individual pages on those websites.

A Brit living in the U.S. called Steve Marsden said he found a record of CEOP - the Child Exploitation and Online Protection Centre - having created a page about Madeleine McCann before 11.58am on 30 April 2007. This, if true, would suggest that something bad happened to Madeleine before then and not on 3 May 2007 as the McCanns claim.

The view espoused especially by posters Nuala and Syn on this forum is that this was an unfortunate (but as yet unspecified) 'glitch' in Wayback's system, i.e. a mistake.

Thanks for the reply Tony!

That machine must take some looking into as I would guess there are a lot of alterations on a daily basis? Even so, if it is true that this guy found such a page, it is a pretty scary thought, unless it was a glitch as thought by the posters you mentioned. If it is true, it makes you wonder what the hell is going on with regards to this family. It's a bit like the Jane Standing news bulletin from New York where she reports WTC 7 having gone down yet it is still standing in the background as she reports live on the BBC. Maybe there are a lot of people out there having premonitions.
avatar
Suspicious Mind

Posts : 10
Activity : 10
Likes received : 0
Join date : 2015-07-07

Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by TheTruthWillOut 08.07.15 15:27

To give a crude and basic example of how computers can seem to do odd things.

When I'm logged out of this forum the timings on posts are in the 12 hour format. I guess the server/computer the forum sits on has its clock set in 12 hour format.

When I log in the posts change to 24 hour format. I guess the forum then takes the time as set in the forum profile settings or what my PC clock setting is (24h)

Of course this Wayback issue is vastly more complex but I think ultimately it probably is an error. I think what makes a lot here more angry is that Wayback aren't obligated to explain the issue. Maybe they might have to reveal flaws/limitations in their crawler program to do so?

They wouldn't really want to do that if they can help it. I'm sure the code or 'recipe' of the crawler program is a closely guarded secret like the Coca Cola recipe.
TheTruthWillOut
TheTruthWillOut

Posts : 733
Activity : 754
Likes received : 19
Join date : 2011-09-26

Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by Nuala 08.07.15 20:19

@ HKP via Rufus T

Given the above statement of ‘if this file doesn't exist web robots assume the web owner wishes to provide no specific instructions, and crawls the entire site’ is this what happened and the entire site was crawled picking up mccann.html & madeleine 01 & 02 jpgs???

This is the robots.txt file for the CEOP website as archived on 29th April at 14:15:59:

User-agent: *
Disallow: /images/
Disallow: /pdfs/
Disallow: /role_profiles/

Nothing about excluding mccann.html there.

Also, just because the robots.txt wasn't crawled on 30 Apr 2007 doesn't mean it wasn't there. It would have been there on 30 Apr 2007, just not crawled on that date.

Note also:

1) robots.txt exclusion requests are just that, only requests. A robots.txt doesn't actually stop a crawler from crawling certain things, it just a request that they don't, so anyone wanting to hide anything wouldn't upload it and use a robots.txt to exclude it from crawlers.

2) robots.txt files are public, anyone can see them, all they have to do is enter the URL [You must be registered and logged in to see this link.] to view the file. So a robots.txt would not be used to " hide" mccann.html because it wouldn't actually hide it.
avatar
Nuala

Posts : 130
Activity : 130
Likes received : 0
Join date : 2015-06-19

Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by Syn 08.07.15 20:26

suzysu wrote:
Syn wrote:
suzysu wrote:@Syn ".....All present and correct and NONE of them archived
on 30th April 2007 despite what the WB source directory claims."

I'm sure it's maddening to you, as a techie, to have to deal with non-tech questions, but don't you find it ODD that an organisation as otherwise-credible as WBM hasn't come out publicly with an explanation?
No not maddening at all suzysu :) And no I don't find it odd at all. They are a non profit organisation who do not have to explain anything to anyone.  They have had the decency to let us know that after further investigation the urls in question were archived incorrectly due to a subset issue which they are trying to resolve. They did not actually have to tell us anything.
Thank you Syn for not finding my question maddening :)

I accept that they're a non profit organisation, but even then, they are HUGE and, as we have been told, their data is (or has been) relied upon in court. 

In order to preserve their integrity, wouldn't one expect (and have a right to expect) that if they have 'archived incorrectly' they have a duty to explain the error? If they don't, how can their data ever be relied upon in the future?

This is making a mockery of their entire raison d'etre.

Very welcome suzysu :)  Always happy to try and answer any questions no matter how non techie :)  There is a lot about this subject that I do not fully understand myself too :)

In answer to your other questions, I see RustyJames has already kindly responded and explained better than I could to explain archive.org's raison d'etre :)
avatar
Syn

Posts : 109
Activity : 110
Likes received : 1
Join date : 2015-06-20

Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by Syn 08.07.15 20:38

rustyjames wrote:
Syn wrote:
suzysu wrote:@Syn ".....All present and correct and NONE of them archived
on 30th April 2007 despite what the WB source directory claims."

I'm sure it's maddening to you, as a techie, to have to deal with non-tech questions, but don't you find it ODD that an organisation as otherwise-credible as WBM hasn't come out publicly with an explanation?
No not maddening at all suzysu :) And no I don't find it odd at all. They are a non profit organisation who do not have to explain anything to anyone.  They have had the decency to let us know that after further investigation the urls in question were archived incorrectly due to a subset issue which they are trying to resolve. They did not actually have to tell us anything.

Syn - I generally agree with most of your posts, but can you explain what you consider a "subset issue" to be as my view is the continued use of the terminology "subset" is a case of Chinese whispers.

For reference my take on it in response to Tony where he quoted the manual section you'd highlighted is here - [You must be registered and logged in to see this link.]

I think you are right in that what I posted was to do with replay mode and I have recently written to archive.org (not mentioning the CEOP pages at all) and have asked questions regarding something that I found that suggests that timestamp issues have been encountered with archives when repackaging subsets of ARC data to (W)ARC files and in some cases back to ARC.  I am hoping that they reply.

This is what led me down the route of asking the question which may or may not lead to anything. 

It is lengthy I'm afraid but I have afeeling that it is something that you would understand as I am sure you have mentioned ARC/WARC previously

WARC spec clarification on transformed WARCs
3 posts by 2 authors  



[You must be registered and logged in to see this link.]
14/01/2009

Other recipients: [You must be registered and logged in to see this link.]

hi WARC Tools,
can you please clarify the WARC spec with regard to the
WARC-Date field (part 1) and warcinfo records in WARCs
transformed from ARCs (part 2) for us? these issues came up when
comparing Heritrix (2.0.2) and warc tools (r242) arc2warc
output.

----------------------------------------------------------------
part 1
----------------------------------------------------------------
according to the WARC spec[1] ISO/DIS 28500 (v0.18):

   5.4 WARC-Date
   "The timestamp shall represent the instant that
   data capture for record creation began."

this may mean that the creation date of the WARC file itself
(from an original ARC) would not be captured. also, WARC files
converted from ARCs which predate the WARC format might have a
WARC-Date field which predates the WARC format.

is this what we want?

this issue came up when comparing the output of:

   1) Heritrix's Arc2Warc.java class, and
   2) WARC Tools' arc2warc

given an ARC file whose date is:

   2008-12-19 23:22:43

converting the ARC to a WARC with Heritrix gives:

   WARC-Date: 2009-01-05T22:25:39Z

in the first record (a warcinfo record), which is the creation
date.

while converting to a WARC with warc tools gives:

   WARC-Date: 2008-12-19T23:22:43Z

in the first record (which is a response record - see part 2).

so, do we want the WARC-Date field in the warcinfo record
to be the date of the first record, or the creation date
of the WARC file itself?

attachments:

arc2warc-arc.txt: head of Original ARC file
arc2warc-h2.txt : head of WARC from Heritrix's Arc2Warc.java
arc2warc-wt.txt : head of WARC from WARC tools arc2warc

----------------------------------------------------------------
part 2:
----------------------------------------------------------------
even more conspicuously, the warc tools transformed WARC gives
the first record as type:

   WARC-Type: response

with a target URI of:

   WARC-Target-URI: [You must be registered and logged in to see this link.]

which yields a significantly different record than the Heritrix
transformed WARC, which gives a 'warcinfo' record as the initial
record of the transformed WARC file. (see attachments)

furthermore, the WARC spec states in section "4 File and record
model":

   All 'warcinfo' 'request', 'metadata' and 'revisit'
   records shall not have a payload.

but Heritrix's Arc2Warc class outputs a warcinfo record that has
a "Filedesc:" payload.


please let us know what you think of these differences so we can
determine how best to converge.


thanks,
/st...@archive.org


[1] [You must be registered and logged in to see this link.]

Attachments (3)

arc2warc-arc.txt
1 KB   View   Download

arc2warc-h2.txt
1 KB   View   Download

arc2warc-wt.txt
1 KB   View   Download
 


Gordon Paynter
19/01/2009

Other recipients: [You must be registered and logged in to see this link.], [You must be registered and logged in to see this link.]

Hi Steve: 

While I cannot answer your questions myself, I did send them to Clement 
at BNF, who made the following response (which I hope he will not mind 
my sharing). I hope you find it useful. 

Gordon 


Hi Gordon, 

I send you few comments on the questions on WARC (Part 2 precedes Part 
1) 

Part 2: 

As far as I know, the Warcinfo record has been designed to play the 
role of the "filedesc" of the ARC format. 
However, the Warcinfo record of a migrated WARC file shall describe the 
migration process (and it is not possible to have two warcinfo records 
within the same WARC file). 

On the other hand, an ARC filedesc record can't be considered as a real 
"response", so it shall not be migrated in a WARC "response" record. 

A solution may be to create a Warcinfo record describing a migration 
process, 
AND 
to create a metadata record containing the content of the ARC filedesc
record. 

On the question of the payload: 
The payload in the WARC standard is defined as a "Data object referred 
to, or contained by a WARC record as a meaningful subset of the content 
block" (p. 3). 

Defining a "meaningful subset" is useful, because one could want to 
check data integrity of the payload (that is the file harvested on the 
Net, without http responses), or identify its format. 

In the Warcinfo record given as an example of the output of Heritrix's 
ARC2WARC class, the text written after the headers seems to be only the 
block of the record, so there is no inconsistency with the standard. 

Part 1: 

It seems to be a very critical issue. 

To my opinion, a WARC response record migrated from a ARC record shall 
have the same date than the previous ARC record. 
That is: 
 a ARC record whose date is 2008-12-19 23:22:43 
shall be migrated in a response record with WARC-Date: 
2008-12-19T23:22:43Z 

On the other hand, the migrated WARC response record should be linked 
to the Warcinfo record describing the migration process, whose date 
should be WARC-Date: 2009-01-05T22:25:39Z 

The date of the metadata record containing the "filedesc" shall also be 
2009-01-05T22:25:39Z, but it will be necessary to put the original date
of the ARC filedesc record somewhere else in the WARC metadata record. 

This solution allows to record: 
- the original harvest date 
- the migration date 
- and it seems a good solution for access tools such as Wayback 
Machine 

It has three shortcomings: 
- this solution is not formally written in the standard (but the 
standard gives no rule to manage migrated WARC files) 
- the WARC response record dates predate the WARC format (but it is not 
a real problem, to my opinion) 
- it is not very consistent with the way we shall treat conversion 
records (they shall have the WARC date of their creation, not of the 
creation of the original WARC record, see the example in the standard p. 
24). 

-... but it seems to me the best solution! 

I hope these few ideas will be useful, please say me what are your 
opinion on these topics. 

Clément 

- - - - - - - - - - 
Clément Oury 
Digital Curator 
Digital Legal Deposit 

Bibliothèque nationale de France 
Quai François-Mauriac 
75706 Paris Cedex 13 
tel. 33 (0)1 53 79 46 93 


>>> "st...@archive.org" 15/01/09 9:42 a.m. >>> 
- show quoted text -
 

siznax
28/01/2009


Gordon and Clement, 

thanks for your thoughtful response. 

your suggestions sound perfectly reasonable. i'll try 
to restate them below so that you can confirm that we 
have reached a consensus. 

given the following WARC states, the following conditions 
should apply: 

1) original WARC 
  warcinfo record should serve as ARC "filedesc" record, 
  with optional WARC generation 

2) migrated WARC (ARC->WARC) 
  a) warcinfo record should serve as migration description, 
     warcinfo/WARC-Date should be migrated WARC creation date 
  b) metadata record should contain content of ARC "filedesc" 
     record, metadata/WARC-Date should be migrated WARC creation 
     date, ARC "filedesc" date should also be in this record, 
     and possibly the WARC generation could be indicated here 
  c) response records should have the same date as each 
     corresponding ARC record 

3) second-generation WARC (WARC->ARC->WARC) 
  a) same conditions as (2), and 
  b) warcinfo record should indicate WARC generation 

i believe we would need to agree then on the form of the 
fields for: 

  2b) original ARC "filedesc" date in migrated WARC metadata 
    record, e.g. metadata/"ARC-Filedesc-Date" with ISO8601 date. 

  1,2b,3b) WARC generation specified in warcinfo record, 
    e.g. warcinfo/"WARC-Generation" with integer value 
    indicating; 0=original WARC, 1=migrated WARC, 
    2=second-generation WARC, etc. 

i'm not sure if "WARC-Generation" is necessary, but it seems 
potentially useful. 


thanks so much, 
/st...@archive.org 

[You must be registered and logged in to see this link.]

Myriad of info on Heritrix and on ARC ->WARC ->ARC etc here but you have to sign up  https://webarchive.jira.com/wiki/pages/viewpage.action?pageId=4865
avatar
Syn

Posts : 109
Activity : 110
Likes received : 1
Join date : 2015-06-20

Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by Guest 08.07.15 21:57

Nuala wrote:@ HKP via Rufus T

Given the above statement of ‘if this file doesn't exist web robots assume the web owner wishes to provide no specific instructions, and crawls the entire site’ is this what happened and the entire site was crawled picking up mccann.html & madeleine 01 & 02 jpgs???

This is the robots.txt file for the CEOP website as archived on 29th April at 14:15:59:

User-agent: *
Disallow: /images/
Disallow: /pdfs/
Disallow: /role_profiles/

Nothing about excluding mccann.html there.

Also, just because the robots.txt wasn't crawled on 30 Apr 2007 doesn't mean it wasn't there. It would have been there on 30 Apr 2007, just not crawled on that date.

Note also:

1) robots.txt exclusion requests are just that, only requests. A robots.txt doesn't actually stop a crawler from crawling certain things, it just a request that they don't, so anyone wanting to hide anything wouldn't upload it and use a robots.txt to exclude it from crawlers.

2) robots.txt files are public, anyone can see them, all they have to do is enter the URL [You must be registered and logged in to see this link.] to view the file. So a robots.txt would not be used to " hide" mccann.html because it wouldn't actually hide it.
I've registered so I can post on this thread,.
Can you show us all the robot.txt for 30/04 rather than the 29/04
Anonymous
Guest
Guest


Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by Rufus T 08.07.15 22:14

Good to see you HKP.
Rufus T
Rufus T

Posts : 269
Activity : 312
Likes received : 3
Join date : 2013-06-18
Location : Glasgow

Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by rustyjames 08.07.15 22:17

Interesting Syn.  Yes I've said a few times I'd love to see analysis of the original .arc files - I would think they'd answer a lot of questions.

I've also wondered if they'd been migrated to .warc and whether there could be issues in that migration, but I would have thought they had that mapping of dates etc well defined prior to a migration.

It's a shame that warc wasn't used in 2007 as it records a lot of extra information and metadata.
avatar
rustyjames

Posts : 293
Activity : 314
Likes received : 3
Join date : 2013-10-16

Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by Guest 08.07.15 22:19

Rufus T wrote:Good to see you HKP.
Thanks Rufus T (for your help earlier as well)  big grin
Anonymous
Guest
Guest


Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by Syn 08.07.15 22:24

HKP wrote:
Nuala wrote:@ HKP via Rufus T

Given the above statement of ‘if this file doesn't exist web robots assume the web owner wishes to provide no specific instructions, and crawls the entire site’ is this what happened and the entire site was crawled picking up mccann.html & madeleine 01 & 02 jpgs???

This is the robots.txt file for the CEOP website as archived on 29th April at 14:15:59:

User-agent: *
Disallow: /images/
Disallow: /pdfs/
Disallow: /role_profiles/

Nothing about excluding mccann.html there.

Also, just because the robots.txt wasn't crawled on 30 Apr 2007 doesn't mean it wasn't there. It would have been there on 30 Apr 2007, just not crawled on that date.

Note also:

1) robots.txt exclusion requests are just that, only requests. A robots.txt doesn't actually stop a crawler from crawling certain things, it just a request that they don't, so anyone wanting to hide anything wouldn't upload it and use a robots.txt to exclude it from crawlers.

2) robots.txt files are public, anyone can see them, all they have to do is enter the URL [You must be registered and logged in to see this link.] to view the file. So a robots.txt would not be used to " hide" mccann.html because it wouldn't actually hide it.
I've registered so I can post on this thread,.
Can you show us all the robot.txt for 30/04 rather than the 29/04
What part of they have taken all the erroneous 30/04/2007 urls out of the WB archive whilst they try and resolve this issue do you not understand? Ergo Nuala nor anyone else cannot provide what you ask but safe to say it will be EXACTLY the same as it was for 29/04/2007
avatar
Syn

Posts : 109
Activity : 110
Likes received : 1
Join date : 2015-06-20

Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by Syn 08.07.15 22:29

rustyjames wrote:Interesting Syn.  Yes I've said a few times I'd love to see analysis of the original .arc files - I would think they'd answer a lot of questions.

I've also wondered if they'd been migrated to .warc and whether there could be issues in that migration, but I would have thought they had that mapping of dates etc well defined prior to a migration.

It's a shame that warc wasn't used in 2007 as it records a lot of extra information and metadata.
I think it may be possible to look at the arc files for dates in and around 30/04/2007 via the Atlassion Jira website I posted earlier.  Am looking into it.

I agree, one would have thought that the data mapping would have been well defined but the convo on the google groups link suggests otherwise.

Yes re 2007 and WARC, they took the timestamp to 17 digits and a lot more info gleaned so if they then repackaged again back to ARC and 14 digits could that be where the errors occurred I wonder?
avatar
Syn

Posts : 109
Activity : 110
Likes received : 1
Join date : 2015-06-20

Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by Guest 08.07.15 22:30

Nuala wrote:@ HKP via Rufus T

Given the above statement of ‘if this file doesn't exist web robots assume the web owner wishes to provide no specific instructions, and crawls the entire site’ is this what happened and the entire site was crawled picking up mccann.html & madeleine 01 & 02 jpgs???

This is the robots.txt file for the CEOP website as archived on 29th April at 14:15:59:

User-agent: *
Disallow: /images/
Disallow: /pdfs/
Disallow: /role_profiles/

Nothing about excluding mccann.html there.

Also, just because the robots.txt wasn't crawled on 30 Apr 2007 doesn't mean it wasn't there. It would have been there on 30 Apr 2007, just not crawled on that date.

Note also:

1) robots.txt exclusion requests are just that, only requests. A robots.txt doesn't actually stop a crawler from crawling certain things, it just a request that they don't, so anyone wanting to hide anything wouldn't upload it and use a robots.txt to exclude it from crawlers.

2) robots.txt files are public, anyone can see them, all they have to do is enter the URL [You must be registered and logged in to see this link.] to view the file. So a robots.txt would not be used to " hide" mccann.html because it wouldn't actually hide it.
As a follow up to my last question (robots txt for 30/04/07) you would have thought that by capturing so many URLs (3876) that it would have at least captured it once, but alas it captured McCann.html instead  big grin
Anonymous
Guest
Guest


Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by Guest 08.07.15 22:39

Syn wrote:
HKP wrote:
Nuala wrote:@ HKP via Rufus T

Given the above statement of ‘if this file doesn't exist web robots assume the web owner wishes to provide no specific instructions, and crawls the entire site’ is this what happened and the entire site was crawled picking up mccann.html & madeleine 01 & 02 jpgs???

This is the robots.txt file for the CEOP website as archived on 29th April at 14:15:59:

User-agent: *
Disallow: /images/
Disallow: /pdfs/
Disallow: /role_profiles/

Nothing about excluding mccann.html there.

Also, just because the robots.txt wasn't crawled on 30 Apr 2007 doesn't mean it wasn't there. It would have been there on 30 Apr 2007, just not crawled on that date.

Note also:

1) robots.txt exclusion requests are just that, only requests. A robots.txt doesn't actually stop a crawler from crawling certain things, it just a request that they don't, so anyone wanting to hide anything wouldn't upload it and use a robots.txt to exclude it from crawlers.

2) robots.txt files are public, anyone can see them, all they have to do is enter the URL [You must be registered and logged in to see this link.] to view the file. So a robots.txt would not be used to " hide" mccann.html because it wouldn't actually hide it.
I've registered so I can post on this thread,.
Can you show us all the robot.txt for 30/04 rather than the 29/04
What part of they have taken all the erroneous 30/04/2007 urls out of the WB archive whilst they try and resolve this issue do you not understand? Ergo Nuala nor anyone else cannot provide what you ask but safe to say it will be EXACTLY the same as it was for 29/04/2007
Woops there goes that assumption again, nothing is safe to say it will be EXACTLY the same because in reality then you don't know. Since you're playing an assuming card let's assume that it didn't pick up a robots.txt and carried out a more rigorous sweep picking up all sorts maybe even mccann. html

What is safe to say is that the records show 28/04 was nothing like 30/04
Anonymous
Guest
Guest


Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by Nuala 08.07.15 22:41

@ HKP

Can you show us all the robot.txt for 30/04 rather than the 29/04

The robots.txt file for 30 Apr isn't available.

But anyway it's irrelevant, because the point you made was that CEOP might have had a robots.txt that excluded mccann.html, which wasn't there on 30 Apr allowing mccann.html to be crawled.

As we can see on 29 Apr 2007 there was no exclusion request for mccann.html in the robots.txt anyway so the robots.txt not existing on 30 Apr would have made no difference.

As a follow up to my last question (robots txt for 30/04/07) you would have thought that by capturing so many URLs (3876) that it would have at least captured it once, but alas it captured McCann.html instead

As the Wayback data for 30 Apr 2007 is screwed up, it might be that nothing was actually captured on 30 Apr 2007.

BTW, I note the big grin, and just to say this might be a game to you, but it isn't a game to me. We're talking here about the disappearance of a little girl and I'm not interested in people trying to score points.

Anyone really wanting to get to the truth of what happened to Madeleine McCann would debate it rationally and maturely, I would hope.
avatar
Nuala

Posts : 130
Activity : 130
Likes received : 0
Join date : 2015-06-19

Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by Syn 08.07.15 22:44

Nuala wrote:@ HKP

Can you show us all the robot.txt for 30/04 rather than the 29/04

The robots.txt file for 30 Apr isn't available.

But anyway it's irrelevant, because the point you made was that CEOP might have had a robots.txt that excluded mccann.html, which wasn't there on 30 Apr allowing mccann.html to be crawled.

As we can see on 29 Apr 2007 there was no exclusion request for mccann.html in the robots.txt anyway so the robots.txt not existing on 30 Apr would have made no difference.

As a follow up to my last question (robots txt for 30/04/07) you would have thought that by capturing so many URLs (3876) that it would have at least captured it once, but alas it captured McCann.html instead

As the Wayback data for 30 Apr 2007 is screwed up, it might be that nothing was actually captured on 30 Apr 2007.

BTW, I note the big grin, and just to say this might be a game to you, but it isn't a game to me. We're talking here about the disappearance of a little girl and I'm not interested in people trying to score points.

Anyone really wanting to get to the truth of what happened to Madeleine McCann would debate it rationally and maturely, I would hope.
Well said Nuala.  Ditto here too
avatar
Syn

Posts : 109
Activity : 110
Likes received : 1
Join date : 2015-06-20

Back to top Go down

Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine. - Page 27 Empty Re: Steve Marsden's WBM screenshot: The CEOP Home page for April 30, 2007 also refers to Missing Madeleine.

Post by Nuala 08.07.15 22:49

Quoting @ Syn

safe to say it will be EXACTLY the same as it was for 29/04/2007

Of course it will. And the idea that someone uploaded mccann.html on 30 Apr, and also uploaded a new robots.txt to exclude that page is ridiculous.

If they wanted to keep mccann.html secret they just wouldn't have uploaded it.

You don't upload a page you want to keep secret and then try and keep it secret with a robots.txt that is public (anyone can view it) and the exclusion request might be ignored by any crawler anyway.

Crazy idea.
avatar
Nuala

Posts : 130
Activity : 130
Likes received : 0
Join date : 2015-06-19

Back to top Go down

Page 27 of 33 Previous  1 ... 15 ... 26, 27, 28 ... 33  Next

View previous topic View next topic Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum