how to archive old reason threads please

16 Jan 2015

Thank you for explaining, wendylou!

I guess then I can't be a lot of help to this as I'm clearly not knowledgeable enough... :frown:

D.

littlejam · 17 Jan 2015

hello,

i'm out to due to a lack of knowledge on how to spider/archive stuff

i hope that some users will be able to cooperate and save the good stuff from the PUF

all is goodgood in the hood: loving this new forum: a breath of fresh air

cheers, j

MirEko · 17 Jan 2015

What about this?
http://www.tenmax.com/teleport/pro/home.htm

17 Jan 2015

Update on my spidering efforts:

Ok.. i've put another couple of hours into it. It was looking promising for a bit but i can't get the constraints right.
I was hoping i could automate the entire proces using httrack, but i think this isn't going to work.

As i've written before the amount of pages is staggering. Around 1.1 mill posts, so thats around 100k pages (because there's 10 posts per page)
At a process time of about 1 sec per page (which is probably optimistic) this whole job will take roughly 100.000/3600 => 27.78 hours.. Which doesn't leave much room for error.

Beware though that this number of 100k pages could easily double, if not triple if you don't get your filters right.

For example, I don't want to follow the following links:

to members
to individual posts (because that would add 1.1 mill downloads)
to reply pages
to external pages

Then there's the thing that the spider has to be logged in. This can be fixed by exporting cookies from a working browser session.
If at some point the site decides the session is no longer valid the entire process stops.. And with an entire run taking 27+ hours i don't want that.

Well, why not do just one subforum ? The problem is, if in any of the posts someone references another subforum i have to write a rule to not have httrack follow that link.... So to do this i have to write rules that take into account the page they find the link on. (something httrack can't do). If that process breaks out of the thread the risk is that it will start downloading the entire forum anyway..

The list goes on......
Its just way to many variables for me to handle

So...

My suggestion is that some people with more weight with propellerheads than me appeal to them and see if something can be worked out works for all of us.
In the mean time i suggest that if someone wants to save a thread to use the "show printable version" in the thread tools. You can find these in the top grey bar on the top post on a thread page. You can set the printable version to showing 40 posts a page. Cutting down the possible amount of saves you have to do by 75%.

wendylou · 17 Jan 2015

^ I concur - same problem I face with Sitesucker: it's very time consuming to effectively set up limitations on what not to crawl and then download the rest. No doubt I could download everything successfully, but can it be browsed locally without the login breaking? So far, I have not found a way to bypass login after download. Otherwise it's pointless to download everything as it would take many hours if not days.

Eagleizer · 17 Jan 2015

wendylou wrote:^ I concur - same problem I face with Sitesucker: it's very time consuming to effectively set up limitations on what not to crawl and then download the rest. No doubt I could download everything successfully, but can it be browsed locally without the login breaking? So far, I have not found a way to bypass login after download. Otherwise it's pointless to download everything as it would take many hours if not days.

Would it take other logins than yours you think?
I mean, could any member login maybe?

wendylou · 17 Jan 2015

The downloaded version still requires login via Propellerheads since the downloaded site is not functioning quite the same because it's not hosting the php database like the real forum. The only way I see to find out what can be done, after the fact, is download the entire site to get every freakin' web file and only then see about disabling the scripts afterwards, but that's a huge gamble because it could take many days to download it all, and maybe only to find out it can't be breached. I'm reluctant to try unless I can confirm in advance that it can be breached, but I'm not sure I can answer that question unless I download everything first, including all javascripts. Even then, if some of this requires a functional php database to operate properly, then I will have wasted days downloading non functional content. Ugh... it's a Catch-22. And even if we could log in via Propellerhead's code to look at the local copy, the Prop's are soon to replace their website with a new one, thus their login code for the forum goes offline.

17 Jan 2015

wendylou wrote:^ I concur - same problem I face with Sitesucker: it's very time consuming to effectively set up limitations on what not to crawl and then download the rest. No doubt I could download everything successfully, but can it be browsed locally without the login breaking? So far, I have not found a way to bypass login after download. Otherwise it's pointless to download everything as it would take many hours if not days.

Eagleizer wrote:
Would it take other logins than yours you think?
I mean, could any member login maybe?

No, it will never take logins. The site will be down, what you download should therefor be a static representation.

wendylou wrote:The downloaded version still requires login via Propellerheads since the downloaded site is not functioning quite the same because it's not hosting the php database like the real forum. The only way I see to find out what can be done, after the fact, is download the entire site to get every freakin' web file and only then see about disabling the scripts afterwards, but that's a huge gamble because it could take many days to download it all, and maybe only to find out it can't be breached. I'm reluctant to try unless I can confirm in advance that it can be breached, but I'm not sure I can answer that question unless I download everything first, including all javascripts. Even then, if some of this requires a functional php database to operate properly, then I will have wasted days downloading non functional content.
Ugh... it's a Catch-22. And even if we could log in via Propellerhead's code to look at the local copy, the Prop's are soon to replace their website with a new one, thus their login code for the forum goes offline.

To be able to have a local copy, it has to be downloaded first. Every single page of it. This requires your spider to be authenticated in order for it to be allowed to download.
If your local copy contains login screens then this means it wasn't allowed to download the content it was after.
Logging is not possible because props site with code and database is down...
Now lets say that because Magic you are able to login... then your local copy still doesn't contain the page behind that because as i said earlier, a login page in your local copy means it didn't download the data it was actually after...

So you'll never get in the situation to be able to have the entire site locally copied AND be stuck with a login page....

I'm sorry.. i'm not very good at explain this

17 Jan 2015

Sounds like the PX7 200k ReFill project...So many redundancies to get rid of =P

eauhm wrote:About the spider.. I'm working on setting it up. Its a bit tougher then just throwing in a url... (ofcourse)
I think i have authentication done via some cookie magic so now its on to the exclude list.

I thought i'd exclude the following fora:
- Post your Music
- Music Forum
- Mobile Apps Forum

But that still leaves me with a grand total of about 1.1 million posts. At 10 posts a page this may be a bit much. So i'm not quite sure if i want to do that.
I've been looking at ways to download more posts at once. The max i can get is 40 via the print view. Unfortunately, coaxing the page to give more doesn't work.

More on this later..

[edit]

Maybe we can appeal to the props to give us the DB for the posts.
personally i can't stand it to throw away a good source of information. Feels like burning a library. You just don't do that :S

littlejam · 17 Jan 2015

hello,

@ wendylou: is there anyway to give us less smart people directions on how to archive our stuff

my intent is to navigate to my user account name, click on 'all posts started by littlejam'
and then try to get all of that stuff

any suggestions, advice, help for the rest of us is appreciated

i just saw that sitesucker appears to be a mac program

i use a pc, win 7 (and palemoon browser (which is like mozilla firefox))

thanks for your help

:s0826: , j

17 Jan 2015

joeyluck wrote:Sounds like the PX7 200k ReFill project...So many redundancies to get rid of =P

Meant to get back to you on that one... I found a file with most of the collected sounds and deleted the duplicates... But I still have more than 22,000 patches...

Will take me a lot of time to sort through those!

D.

wendylou · 17 Jan 2015

OK, I gave this another crack and I'm having success downloading the entire forum without restrictions or logins required. It's gonna take awhile, as everything get's converted to a static, local website. :s0826:

17 Jan 2015

wendylou wrote:OK, I gave this another crack and I'm having success downloading the entire forum without restrictions or logins required. It's gonna take awhile, as everything get's converted to a static, local website.

Hurray!!!!

Thank you for your efforts!

D.

17 Jan 2015

wendylou wrote:OK, I gave this another crack and I'm having success downloading the entire forum without restrictions or logins required. It's gonna take awhile, as everything get's converted to a static, local website. :s0826:

Awesome job, i admire your perseverance ! I'm curious how long the entire job is going to take, its quite a huge task

wendylou · 17 Jan 2015

Well let me do the math! 4700 files downloaded and 20,330 and incrementing, and I'm only on level 4. This is a huge land grab.

The trick was to set it to ignore robots.txt file and ignore logins. Unfortunately my filters are not tuned right and I'm getting other content, but some can be kept like tutorials, free refills and patches, etc. It's even grabbing our avatar pics.

littlejam · 17 Jan 2015

hello,

@ wendylou: you most rock!!!

:s0826: , j

17 Jan 2015

wendylou wrote:Well let me do the math! 4700 files downloaded and 20,330 and incrementing, and I'm only on level 4. This is a huge land grab.

The trick was to set it to ignore robots.txt file and ignore logins. Unfortunately my filters are not tuned right and I'm getting other content, but some can be kept like tutorials, free refills and patches, etc. It's even grabbing our avatar pics.

I hope you don't mind me asking this, but are you downloading anonymously or as a certain user ?
Did you manually filter out certain pages like "showpost.php" and "newreply.php" or are you just restricting spanning to other hosts ?

Thanks !

Julibee · 17 Jan 2015

wendylou wrote:Well let me do the math! 4700 files downloaded and 20,330 and incrementing, and I'm only on level 4. This is a huge land grab.

The trick was to set it to ignore robots.txt file and ignore logins. Unfortunately my filters are not tuned right and I'm getting other content, but some can be kept like tutorials, free refills and patches, etc. It's even grabbing our avatar pics.

This is amazing. Thank you for trying, Wendy!!!

wendylou · 17 Jan 2015

@julibee, thanks, i sure hope it's worth it. I'll know when it's all done.

@eauhm, I'm not logged in as a user. This is Sitesucker and I believe it's just a nice GUI for command line "wget". Yeah, I am getting it all. showpost.php is the most interesting but it's also grabbing forumdisplay.php, misc.php, member.php, et al. It will be fully browsable as an offline site with no calls to external files, assuming I get them all.

Moo ha ha, I feel like Dr. Evil! World Control or bust!

On another note, although I can understand the Props wanting to get out of the forum business, especially with the way it's been going, it sort of feels like the burning down of the Library of Alexandria, not that anything we have contributed is as profound as Aristotle, Plato or anyone

17 Jan 2015

wendylou wrote:@julibee, thanks, i sure hope it's worth it. I'll know when it's all done.

@eauhm, I'm not logged in as a user. This is Sitesucker and I believe it's just a nice GUI for command line "wget". Yeah, I am getting it all. showpost.php is the most interesting but it's also grabbing forumdisplay.php, misc.php, member.php, et al. It will be fully browsable as an offline site with no calls to external files, assuming I get them all.

Moo ha ha, I feel like Dr. Evil! World Control or bust!

Ah yeah, wget is what i started with. Be carefull to not interrupt the job then, wget is very bad at resuming a task like this. Godspeed

17 Jan 2015

wendylou wrote:@julibee, thanks, i sure hope it's worth it. I'll know when it's all done.

@eauhm, I'm not logged in as a user. This is Sitesucker and I believe it's just a nice GUI for command line "wget". Yeah, I am getting it all. showpost.php is the most interesting but it's also grabbing forumdisplay.php, misc.php, member.php, et al. It will be fully browsable as an offline site with no calls to external files, assuming I get them all.

Moo ha ha, I feel like Dr. Evil! World Control or bust!

On another note, although I can understand the Props wanting to get out of the forum business, especially with the way it's been going, it sort of feels like the burning down of the Library of Alexandria, not that anything we have contributed is as profound as Aristotle, Plato or anyone

Wow, great job Wendy! All is not lost I suppose :t2018:

17 Jan 2015

wendylou wrote:On another note, although I can understand the Props wanting to get out of the forum business, especially with the way it's been going, it sort of feels like the burning down of the Library of Alexandria, not that anything we have contributed is as profound as Aristotle, Plato or anyone

This is scary. The library of alexandria is exactly what i thought of. Ever since i learned what happened there i cannot help but think of it when i hear of valuable sources information being destroyed.. I cant even begin to imagine what kind of impact that knowledge couldve still had hadn't we lost it...

17 Jan 2015

So who is going to host all of this information? With such a large amount of data, it's bound to be expensive.

Also, it can't be posted online with the Propellerhead branding, so the information and formatting will need to be stripped from the pages and created anew. I can't think of an automated process for this (but I'm not a web developer)

You might be able to get a torrent set up for people to d/l if they want it, but would need to consider Propellerhead's IP rights over the information.

wendylou · 17 Jan 2015

Not sure what the best way to handle the archive. Keeping my download offline for occasional extraction requests is one option. Regarding hosting it, I'm not going to do it. Regarding branding, very easy to delete any shared forum images that brand it. The Wayback Machine archives web page but only displays images if the path is still live, otherwise it's missing images; if I delete the shared image folder, same thing; only textual pages will remain, not branding. Most of these html pages are 58k, not much of a footprint. But collectively it could be several gigs. Saving it is my only goal now, we can figure out who wants what later. Might I suggest we extract tutorials, combinators, sage advice from the likes of Selig and similar, and just forget the fluff and noise? If we did it that way, I can sen people zips of their stuff they thought was important to share with others. But if this ends up being functionally intact and I delete the branding, if someone wants it to host, I'm hoping I can burn to DVD-ROM(s).

3rd Floor Sound · 17 Jan 2015

First let me say you people are wonderful.
as far as hosting the content, I believe copy.com has a free account of 20gb as opposed to dropbox's 2gb

how to archive old reason threads please

Who is online